Class StreamParser

java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.StreamParser

public class StreamParser extends BaseParser
StreamParser

Extracts tables from digitally generated PDFs by reading text positions via PDFBox and grouping glyphs into rows and columns. This strategy works best when a reliable text layer exists (non-scanned documents).

High-level steps

  1. Collect glyphs on the page using PDFBox (PDFTextStripper).
  2. Group glyphs into visual rows using Y proximity.
  3. Within each row, merge adjacent glyphs into word spans; sort by X.
  4. Infer column boundaries from persistent gaps across rows.
  5. Assign spans to columns to build a Table grid.
  • Constructor Details

    • StreamParser

      public StreamParser(String filepath)
    • StreamParser

      public StreamParser()
      Creates a StreamParser for in-memory processing. The PDF document must be passed to the parse() method.
  • Method Details

    • parsePage

      @Deprecated protected List<Table> parsePage(int page) throws IOException
      Deprecated.
      This method loads the document from disk on every call. Prefer loading the PDDocument once and using parse(PDDocument).
      Description copied from class: BaseParser
      Parses a single page or the entire document.

      Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).

      Specified by:
      parsePage in class BaseParser
      Parameters:
      page - page index to parse, or -1 to parse all pages
      Returns:
      a list of Table objects extracted from the requested page(s) (possibly empty)
      Throws:
      IOException - if an error occurs while parsing
    • parse

      public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Description copied from class: BaseParser
      Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
      Specified by:
      parse in class BaseParser
      Parameters:
      document - The PDDocument to parse.
      Returns:
      A list of extracted tables.
      Throws:
      IOException - for I/O issues during parsing.