Package com.extractpdf4j.parsers
Class HybridParser
java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.HybridParser
HybridParser
A high-level parser that tries multiple underlying strategies and returns the best table set for the requested page(s). Specifically, it runs:
StreamParser— text-position based parsing (good for digitally created PDFs)LatticeParser— grid/line detection using OpenCV (good for ruled or scanned PDFs)OcrStreamParser— OCR-backed stream parsing (good for image PDFs without text layer)
score(Table)).
Usage
List<Table> tables = new HybridParser("path/to/file.pdf")
.dpi(300f) // optional, helps scans
.debug(true) // optional, write lattice/ocr debug artifacts
.pages("all") // "1", "2-5", "1,3-4", or "all"
.parse();
Page selection contract
Inherits the BaseParser convention: if parsePage(int) is invoked with
-1, the implementation must parse all pages. For any non-negative
value, only that page is parsed. This class narrows its internal subparsers accordingly.
Thread-safety
Instances are not inherently thread-safe. Create one instance per input file or perform external synchronization if sharing across threads.
- Since:
- 2025
- Author:
- Mehuli Mukherjee
-
Field Summary
Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText -
Constructor Summary
ConstructorsConstructorDescriptionCreates aHybridParserfor in-memory processing.HybridParser(String filepath) Creates aHybridParserfor the given PDF file path. -
Method Summary
Modifier and TypeMethodDescriptiondebug(boolean on) Enables or disables debug outputs for lattice/OCR strategies.Directory where debug artifacts should be written (lattice + OCR).dpi(float dpi) Sets DPI for image-based parsing (used by lattice + OCR strategies).keepCells(boolean on) Whether to preserve empty cells when reconstructing grids (lattice only).minScore(double score) Sets the minimum allowed average score across a list of tables.Sets the page selection for this parser and propagates the same selection to all underlying strategies.parse(org.apache.pdfbox.pdmodel.PDDocument document) Parses a previously loaded PDF document.parsePage(int page) Runs stream, lattice, and OCR-backed stream for the requested page(s) and returns the best-scoring set of tables.stripText(boolean strip) Enables or disables text normalization for stream-style extraction across all underlying strategies.Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, parse
-
Constructor Details
-
HybridParser
Creates aHybridParserfor the given PDF file path.- Parameters:
filepath- path to the PDF file
-
HybridParser
public HybridParser()Creates aHybridParserfor in-memory processing. The PDF document must be passed to the new parse() method.
-
-
Method Details
-
dpi
Sets DPI for image-based parsing (used by lattice + OCR strategies).- Parameters:
dpi- dots per inch used for rasterization (e.g., 300f for scans)- Returns:
- this parser
-
debug
Enables or disables debug outputs for lattice/OCR strategies.- Parameters:
on-trueto enable,falseto disable- Returns:
- this parser
-
keepCells
Whether to preserve empty cells when reconstructing grids (lattice only).- Parameters:
on-trueto keep empty cells- Returns:
- this parser
-
debugDir
Directory where debug artifacts should be written (lattice + OCR).- Parameters:
dir- destination directory- Returns:
- this parser
-
minScore
Sets the minimum allowed average score across a list of tables. If the list's average score is below this threshold, it will be rejected.- Parameters:
score- minimal score in [0, 1]- Returns:
- this parser
-
pages
Sets the page selection for this parser and propagates the same selection to all underlying strategies.- Overrides:
pagesin classBaseParser- Parameters:
pages- page selection string (e.g.,"all","1","2-5","1,3-4")- Returns:
- this parser
-
stripText
Enables or disables text normalization for stream-style extraction across all underlying strategies.- Overrides:
stripTextin classBaseParser- Parameters:
strip-trueto normalize/strip text,falseto keep raw text- Returns:
- this parser (for chaining)
-
parsePage
Runs stream, lattice, and OCR-backed stream for the requested page(s) and returns the best-scoring set of tables.If
page == -1, each strategy is run across all pages. Otherwise, each strategy is temporarily narrowed to the single requested page (restoring the original page spec afterward).- Specified by:
parsePagein classBaseParser- Parameters:
page- page index to parse, or-1to parse all pages- Returns:
- the winning list of
Tableobjects (possibly empty) - Throws:
IOException- if an underlying parser fails
-
parse
Description copied from class:BaseParserParses a previously loaded PDF document. This is the preferred method for in-memory processing.- Specified by:
parsein classBaseParser- Parameters:
document- The PDDocument to parse.- Returns:
- A list of extracted tables.
- Throws:
IOException- for I/O issues during parsing.
-