Annotation Interface ExtractPdfConfig


@Retention(RUNTIME) @Target(TYPE) public @interface ExtractPdfConfig
Annotation-based configuration for ExtractPDF4J parsers.

Apply this annotation to a class to declare parser settings that can be materialized via ExtractPdfAnnotations.

  • Optional Element Summary

    Optional Elements
    Modifier and Type
    Optional Element
    Description
    boolean
    Enables debug artifact output for lattice/ocr/hybrid.
    Directory where debug artifacts should be written.
    float
    DPI for image-based parsing (lattice/ocr/hybrid).
    boolean
    Whether to keep empty cells in lattice parsing.
    double
    Minimum average score for hybrid parser selection.
    Page selection string (e.g., "all", "1", "2-5", "1,3-4").
    Parser strategy to use when materializing a parser.
    Required OCR headers to look for before returning results.
    boolean
    Whether to strip/normalize text for stream-based extraction.
  • Element Details

    • parser

      ParserMode parser
      Parser strategy to use when materializing a parser.
      Default:
      HYBRID
    • pages

      String pages
      Page selection string (e.g., "all", "1", "2-5", "1,3-4").
      Default:
      "1"
    • stripText

      boolean stripText
      Whether to strip/normalize text for stream-based extraction.
      Default:
      true
    • dpi

      float dpi
      DPI for image-based parsing (lattice/ocr/hybrid).
      Default:
      450.0f
    • debug

      boolean debug
      Enables debug artifact output for lattice/ocr/hybrid.
      Default:
      false
    • keepCells

      boolean keepCells
      Whether to keep empty cells in lattice parsing.
      Default:
      false
    • minScore

      double minScore
      Minimum average score for hybrid parser selection.
      Default:
      0.0
    • debugDir

      String debugDir
      Directory where debug artifacts should be written.
      Default:
      ""
    • requiredHeaders

      String[] requiredHeaders
      Required OCR headers to look for before returning results.
      Default:
      {}