com.extractpdf4j.parsers.OcrStreamParser

public class OcrStreamParser extends BaseParser

OcrStreamParser (header-aware): - Removes horizontal *and* vertical rules before OCR. - Uses Tesseract TSV to read words. - Anchors column boundaries using the table header ("Date", "Description", "Debit", "Credit", "Balance") via fuzzy matching; falls back to histogram of mid-gaps when header cannot be confidently detected. - Normalizes numeric/date columns. This version is a drop-in replacement for the original OcrStreamParser.

Field Summary

Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText
Constructor Summary

Constructors

Constructor

Description

OcrStreamParser()

Creates an OcrStreamParser for in-memory processing.

OcrStreamParser(String filepath)
Method Summary

Modifier and Type

Method

Description

OcrStreamParser

debug(boolean on)

OcrStreamParser

debugDir(File dir)

OcrStreamParser

dpi(float dpi)

List<Table>

parse(org.apache.pdfbox.pdmodel.PDDocument document)

Parses a previously loaded PDF document.

protected List<Table>

parsePage(int page)

Deprecated.
This method loads the document from disk on every call.

OcrStreamParser

requiredHeaders(List<String> headers)

Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, pages, parse, stripText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- OcrStreamParser
  
  public OcrStreamParser(String filepath)
- OcrStreamParser
  
  public OcrStreamParser()
  
  Creates an OcrStreamParser for in-memory processing. The PDF document must be passed to the parse() method.
Method Details
- dpi
  
  public OcrStreamParser dpi(float dpi)
- debug
  
  public OcrStreamParser debug(boolean on)
- debugDir
  
  public OcrStreamParser debugDir(File dir)
- requiredHeaders
  
  public OcrStreamParser requiredHeaders(List<String> headers)
- parsePage
  
  @Deprecated protected List<Table> parsePage(int page) throws IOException
  
  Deprecated.
  This method loads the document from disk on every call. Prefer loading the PDDocument once and using parse(PDDocument).
  
  Description copied from class: BaseParser
  
  Parses a single page or the entire document.
  Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).
  
  Specified by:
  
  parsePage in class BaseParser
  
  Parameters:
  
  page - page index to parse, or -1 to parse all pages
  
  Returns:
  
  a list of Table objects extracted from the requested page(s) (possibly empty)
  
  Throws:
  
  IOException - if an error occurs while parsing
- parse
  
  public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
  
  Description copied from class: BaseParser
  
  Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
  
  Specified by:
  
  parse in class BaseParser
  
  Parameters:
  
  document - The PDDocument to parse.
  
  Returns:
  
  A list of extracted tables.
  
  Throws:
  
  IOException - for I/O issues during parsing.

Class OcrStreamParser

Field Summary

Fields inherited from class com.extractpdf4j.parsers.BaseParser

Constructor Summary

Method Summary

Methods inherited from class com.extractpdf4j.parsers.BaseParser

Methods inherited from class java.lang.Object

Constructor Details

OcrStreamParser

OcrStreamParser

Method Details

dpi

debug

debugDir

requiredHeaders

parsePage

parse