Class HybridParser

java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.HybridParser

public class HybridParser extends BaseParser
HybridParser

A high-level parser that tries multiple underlying strategies and returns the best table set for the requested page(s). Specifically, it runs:

  • StreamParser — text-position based parsing (good for digitally created PDFs)
  • LatticeParser — grid/line detection using OpenCV (good for ruled or scanned PDFs)
  • OcrStreamParser — OCR-backed stream parsing (good for image PDFs without text layer)
and chooses the result using a simple heuristic scoring function (see score(Table)).

Usage


 List<Table> tables = new HybridParser("path/to/file.pdf")
     .dpi(300f)       // optional, helps scans
     .debug(true)     // optional, write lattice/ocr debug artifacts
     .pages("all")    // "1", "2-5", "1,3-4", or "all"
     .parse();
 

Page selection contract

Inherits the BaseParser convention: if parsePage(int) is invoked with -1, the implementation must parse all pages. For any non-negative value, only that page is parsed. This class narrows its internal subparsers accordingly.

Thread-safety

Instances are not inherently thread-safe. Create one instance per input file or perform external synchronization if sharing across threads.

Since:
2025
Author:
Mehuli Mukherjee
  • Constructor Details

    • HybridParser

      public HybridParser(String filepath)
      Creates a HybridParser for the given PDF file path.
      Parameters:
      filepath - path to the PDF file
    • HybridParser

      public HybridParser()
      Creates a HybridParser for in-memory processing. The PDF document must be passed to the new parse() method.
  • Method Details

    • dpi

      public HybridParser dpi(float dpi)
      Sets DPI for image-based parsing (used by lattice + OCR strategies).
      Parameters:
      dpi - dots per inch used for rasterization (e.g., 300f for scans)
      Returns:
      this parser
    • debug

      public HybridParser debug(boolean on)
      Enables or disables debug outputs for lattice/OCR strategies.
      Parameters:
      on - true to enable, false to disable
      Returns:
      this parser
    • keepCells

      public HybridParser keepCells(boolean on)
      Whether to preserve empty cells when reconstructing grids (lattice only).
      Parameters:
      on - true to keep empty cells
      Returns:
      this parser
    • debugDir

      public HybridParser debugDir(File dir)
      Directory where debug artifacts should be written (lattice + OCR).
      Parameters:
      dir - destination directory
      Returns:
      this parser
    • minScore

      public HybridParser minScore(double score)
      Sets the minimum allowed average score across a list of tables. If the list's average score is below this threshold, it will be rejected.
      Parameters:
      score - minimal score in [0, 1]
      Returns:
      this parser
    • pages

      public BaseParser pages(String pages)
      Sets the page selection for this parser and propagates the same selection to all underlying strategies.
      Overrides:
      pages in class BaseParser
      Parameters:
      pages - page selection string (e.g., "all", "1", "2-5", "1,3-4")
      Returns:
      this parser
    • stripText

      public HybridParser stripText(boolean strip)
      Enables or disables text normalization for stream-style extraction across all underlying strategies.
      Overrides:
      stripText in class BaseParser
      Parameters:
      strip - true to normalize/strip text, false to keep raw text
      Returns:
      this parser (for chaining)
    • parsePage

      protected List<Table> parsePage(int page) throws IOException
      Runs stream, lattice, and OCR-backed stream for the requested page(s) and returns the best-scoring set of tables.

      If page == -1, each strategy is run across all pages. Otherwise, each strategy is temporarily narrowed to the single requested page (restoring the original page spec afterward).

      Specified by:
      parsePage in class BaseParser
      Parameters:
      page - page index to parse, or -1 to parse all pages
      Returns:
      the winning list of Table objects (possibly empty)
      Throws:
      IOException - if an underlying parser fails
    • parse

      public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Description copied from class: BaseParser
      Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
      Specified by:
      parse in class BaseParser
      Parameters:
      document - The PDDocument to parse.
      Returns:
      A list of extracted tables.
      Throws:
      IOException - for I/O issues during parsing.