Package com.extractpdf4j.parsers
Class StreamParser
java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.StreamParser
StreamParser
Extracts tables from digitally generated PDFs by reading text positions via PDFBox and grouping glyphs into rows and columns. This strategy works best when a reliable text layer exists (non-scanned documents).
High-level steps
- Collect glyphs on the page using PDFBox (
PDFTextStripper). - Group glyphs into visual rows using Y proximity.
- Within each row, merge adjacent glyphs into word spans; sort by X.
- Infer column boundaries from persistent gaps across rows.
- Assign spans to columns to build a
Tablegrid.
-
Field Summary
Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText -
Constructor Summary
ConstructorsConstructorDescriptionCreates aStreamParserfor in-memory processing.StreamParser(String filepath) -
Method Summary
Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, pages, parse, stripText
-
Constructor Details
-
StreamParser
-
StreamParser
public StreamParser()Creates aStreamParserfor in-memory processing. The PDF document must be passed to the parse() method.
-
-
Method Details
-
parsePage
Deprecated.This method loads the document from disk on every call. Prefer loading the PDDocument once and usingparse(PDDocument).Description copied from class:BaseParserParses a single page or the entire document.Contract: If
page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).- Specified by:
parsePagein classBaseParser- Parameters:
page- page index to parse, or-1to parse all pages- Returns:
- a list of
Tableobjects extracted from the requested page(s) (possibly empty) - Throws:
IOException- if an error occurs while parsing
-
parse
Description copied from class:BaseParserParses a previously loaded PDF document. This is the preferred method for in-memory processing.- Specified by:
parsein classBaseParser- Parameters:
document- The PDDocument to parse.- Returns:
- A list of extracted tables.
- Throws:
IOException- for I/O issues during parsing.
-