Package com.extractpdf4j.parsers
Class LatticeParser
java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.LatticeParser
LatticeParser
Detects table structure by rasterizing pages and finding horizontal/vertical ruling lines with OpenCV. Reconstructs a cell grid, maps PDF text into cells, and optionally runs OCR for sparsely filled cells.
Pipeline
- Render page to image at
renderDpi. - Binarize for line detection (adaptive threshold).
- Extract horizontal/vertical lines via morphology; project to get line positions.
- Build grid from line intersections; map PDF glyphs to cell coords.
- Fallback OCR for cells if text coverage is low.
- Emit
Tablewith grid + row/column boundaries.
Page indexing follows the BaseParser convention: this class expects
parsePage(1) for the first page; parsePage(-1) means “all pages”.
-
Field Summary
Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText -
Constructor Summary
ConstructorsConstructorDescriptionCreates aLatticeParserfor in-memory processing.LatticeParser(String filepath) -
Method Summary
Modifier and TypeMethodDescriptiondebug(boolean on) Toggle debug overlays/artifacts.Set debug artifact directory.dpi(float dpi) Set rasterization DPI.keepCells(boolean on) Keep empty cells in the final grid (useful for fixed layouts).parse(org.apache.pdfbox.pdmodel.PDDocument document) Parses a previously loaded PDF document.parsePage(int page) Deprecated.This method loads the document from disk on every call.Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, pages, parse, stripText
-
Constructor Details
-
LatticeParser
-
LatticeParser
public LatticeParser()Creates aLatticeParserfor in-memory processing. The PDF document must be passed to the parse() method.
-
-
Method Details
-
debug
Toggle debug overlays/artifacts. -
keepCells
Keep empty cells in the final grid (useful for fixed layouts). -
dpi
Set rasterization DPI. -
debugDir
Set debug artifact directory. -
parsePage
Deprecated.This method loads the document from disk on every call. Prefer loading the PDDocument once and usingparse(PDDocument).Description copied from class:BaseParserParses a single page or the entire document.Contract: If
page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).- Specified by:
parsePagein classBaseParser- Parameters:
page- page index to parse, or-1to parse all pages- Returns:
- a list of
Tableobjects extracted from the requested page(s) (possibly empty) - Throws:
IOException- if an error occurs while parsing
-
parse
Description copied from class:BaseParserParses a previously loaded PDF document. This is the preferred method for in-memory processing.- Specified by:
parsein classBaseParser- Parameters:
document- The PDDocument to parse.- Returns:
- A list of extracted tables.
- Throws:
IOException- for I/O issues during parsing.
-