Class BaseParser
- Direct Known Subclasses:
HybridParser,LatticeParser,OcrStreamParser,StreamParser
Abstract base for all PDF table parsers in com.extractpdf4j.
Concrete implementations (e.g., StreamParser, LatticeParser,
OcrStreamParser, HybridParser) should extend this class and
implement parsePage(int).
Responsibilities
- Holds common configuration shared by all parsers (file path, page ranges, flags).
- Provides a final, high-level
parse()that resolves the page selection and delegates work toparsePage(int).
Page selection contract
Page ranges are provided as a human-friendly string via pages(String).
The format supports values such as "1", "2-5", "1,3-4",
and "all". The helper PageRange.parse(..) converts this into
a list of integers. Implementations must honor the following convention:
- If
parsePage(-1)is called, it indicates all pages should be parsed. - Otherwise,
parsePage(p)is called once per requested page numberp.
Thread-safety
Instances are not inherently thread-safe. Create one parser instance per input file or synchronize external access if you share state across threads.
- Since:
- 2025
- Author:
- Mehuli Mukherjee
-
Field Summary
Fields -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedConstructs a parser for in-memory processing.protectedBaseParser(String filepath) Constructs a parser for the given PDF file. -
Method Summary
Modifier and TypeMethodDescriptionfinalizeResults(List<Table> tables, String sourcePath) Normalizes parser output for "no tables" situations.Sets the pages to parse.parse()Parses the configured pages from the PDF file.parse(org.apache.pdfbox.pdmodel.PDDocument document) Parses a previously loaded PDF document.parsePage(int page) Parses a single page or the entire document.stripText(boolean strip) Enables or disables text normalization for stream-style extraction.
-
Field Details
-
filepath
Absolute or relative path to the PDF file being parsed. This may be null when processing in-memory documents. -
pages
Page selection string, defaulting to"1". Accepts formats like"1","2-5","1,3-4", or"all". -
stripText
protected boolean stripTextWhether to normalize/strip text (e.g., trim, collapse whitespace) in stream-based extraction. Implementations may choose how to interpret this flag.
-
-
Constructor Details
-
BaseParser
Constructs a parser for the given PDF file.- Parameters:
filepath- path to the PDF file
-
BaseParser
protected BaseParser()Constructs a parser for in-memory processing. The filepath will be null.
-
-
Method Details
-
pages
Sets the pages to parse. See the class docs for supported formats.- Parameters:
pages- page selection string (e.g.,"all","1","2-5","1,3-4")- Returns:
- this parser (for chaining)
-
stripText
Enables or disables text normalization for stream-style extraction. Implementations may ignore this flag if not applicable.- Parameters:
strip-trueto normalize/strip text,falseto keep raw text- Returns:
- this parser (for chaining)
-
parse
Parses the configured pages from the PDF file.This method resolves the page selection via
PageRange.parse(pages)and then delegates toparsePage(int). If the parsed list contains only-1,parsePage(int)is called with-1to indicate "all pages". Otherwise, it is called once for each requested page number.- Returns:
- a list of
Tableinstances extracted from the requested pages (possibly empty) - Throws:
IOException- if reading the file fails or a parsing error occurs- Since:
- 2025
-
parsePage
Parses a single page or the entire document.Contract: If
page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).- Parameters:
page- page index to parse, or-1to parse all pages- Returns:
- a list of
Tableobjects extracted from the requested page(s) (possibly empty) - Throws:
IOException- if an error occurs while parsing- Since:
- 2025
-
finalizeResults
Normalizes parser output for "no tables" situations.If
tablesisnullor empty, logs a concise message and returnsCollections.emptyList(). Otherwise return the input list unchanged.- Parameters:
tables- tables collected for the requested page(s)sourcePath- path to the input PDF (logging only)- Returns:
- a non-null list of tables
- Since:
- 2025
-
parse
Parses a previously loaded PDF document. This is the preferred method for in-memory processing.- Parameters:
document- The PDDocument to parse.- Returns:
- A list of extracted tables.
- Throws:
IOException- for I/O issues during parsing.
-