Class OcrStreamParser

java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.OcrStreamParser

public class OcrStreamParser extends BaseParser
OcrStreamParser (header-aware): - Removes horizontal *and* vertical rules before OCR. - Uses Tesseract TSV to read words. - Anchors column boundaries using the table header ("Date", "Description", "Debit", "Credit", "Balance") via fuzzy matching; falls back to histogram of mid-gaps when header cannot be confidently detected. - Normalizes numeric/date columns. This version is a drop-in replacement for the original OcrStreamParser.
  • Constructor Details

    • OcrStreamParser

      public OcrStreamParser(String filepath)
    • OcrStreamParser

      public OcrStreamParser()
      Creates an OcrStreamParser for in-memory processing. The PDF document must be passed to the parse() method.
  • Method Details

    • dpi

      public OcrStreamParser dpi(float dpi)
    • debug

      public OcrStreamParser debug(boolean on)
    • debugDir

      public OcrStreamParser debugDir(File dir)
    • requiredHeaders

      public OcrStreamParser requiredHeaders(List<String> headers)
    • parsePage

      @Deprecated protected List<Table> parsePage(int page) throws IOException
      Deprecated.
      This method loads the document from disk on every call. Prefer loading the PDDocument once and using parse(PDDocument).
      Description copied from class: BaseParser
      Parses a single page or the entire document.

      Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).

      Specified by:
      parsePage in class BaseParser
      Parameters:
      page - page index to parse, or -1 to parse all pages
      Returns:
      a list of Table objects extracted from the requested page(s) (possibly empty)
      Throws:
      IOException - if an error occurs while parsing
    • parse

      public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Description copied from class: BaseParser
      Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
      Specified by:
      parse in class BaseParser
      Parameters:
      document - The PDDocument to parse.
      Returns:
      A list of extracted tables.
      Throws:
      IOException - for I/O issues during parsing.