Package com.extractpdf4j.parsers
Class OcrStreamParser
java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.OcrStreamParser
OcrStreamParser (header-aware):
- Removes horizontal *and* vertical rules before OCR.
- Uses Tesseract TSV to read words.
- Anchors column boundaries using the table header ("Date", "Description", "Debit", "Credit", "Balance")
via fuzzy matching; falls back to histogram of mid-gaps when header cannot be confidently detected.
- Normalizes numeric/date columns.
This version is a drop-in replacement for the original OcrStreamParser.
-
Field Summary
Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText -
Constructor Summary
ConstructorsConstructorDescriptionCreates anOcrStreamParserfor in-memory processing.OcrStreamParser(String filepath) -
Method Summary
Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, pages, parse, stripText
-
Constructor Details
-
OcrStreamParser
-
OcrStreamParser
public OcrStreamParser()Creates anOcrStreamParserfor in-memory processing. The PDF document must be passed to the parse() method.
-
-
Method Details
-
dpi
-
debug
-
debugDir
-
requiredHeaders
-
parsePage
Deprecated.This method loads the document from disk on every call. Prefer loading the PDDocument once and usingparse(PDDocument).Description copied from class:BaseParserParses a single page or the entire document.Contract: If
page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).- Specified by:
parsePagein classBaseParser- Parameters:
page- page index to parse, or-1to parse all pages- Returns:
- a list of
Tableobjects extracted from the requested page(s) (possibly empty) - Throws:
IOException- if an error occurs while parsing
-
parse
Description copied from class:BaseParserParses a previously loaded PDF document. This is the preferred method for in-memory processing.- Specified by:
parsein classBaseParser- Parameters:
document- The PDDocument to parse.- Returns:
- A list of extracted tables.
- Throws:
IOException- for I/O issues during parsing.
-