com.extractpdf4j.parsers.StreamParser

public class StreamParser extends BaseParser

StreamParser

Extracts tables from digitally generated PDFs by reading text positions via PDFBox and grouping glyphs into rows and columns. This strategy works best when a reliable text layer exists (non-scanned documents).

High-level steps

Collect glyphs on the page using PDFBox (PDFTextStripper).
Group glyphs into visual rows using Y proximity.
Within each row, merge adjacent glyphs into word spans; sort by X.
Infer column boundaries from persistent gaps across rows.
Assign spans to columns to build a Table grid.

Field Summary

Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText
Constructor Summary

Constructors

Constructor

Description

StreamParser()

Creates a StreamParser for in-memory processing.

StreamParser(String filepath)
Method Summary

Modifier and Type

Method

Description

List<Table>

parse(org.apache.pdfbox.pdmodel.PDDocument document)

Parses a previously loaded PDF document.

protected List<Table>

parsePage(int page)

Deprecated.
This method loads the document from disk on every call.

Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, pages, parse, stripText

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- StreamParser
  
  public StreamParser(String filepath)
- StreamParser
  
  public StreamParser()
  
  Creates a StreamParser for in-memory processing. The PDF document must be passed to the parse() method.
Method Details
- parsePage
  
  @Deprecated protected List<Table> parsePage(int page) throws IOException
  
  Deprecated.
  This method loads the document from disk on every call. Prefer loading the PDDocument once and using parse(PDDocument).
  
  Description copied from class: BaseParser
  
  Parses a single page or the entire document.
  Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).
  
  Specified by:
  
  parsePage in class BaseParser
  
  Parameters:
  
  page - page index to parse, or -1 to parse all pages
  
  Returns:
  
  a list of Table objects extracted from the requested page(s) (possibly empty)
  
  Throws:
  
  IOException - if an error occurs while parsing
- parse
  
  public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
  
  Description copied from class: BaseParser
  
  Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
  
  Specified by:
  
  parse in class BaseParser
  
  Parameters:
  
  document - The PDDocument to parse.
  
  Returns:
  
  A list of extracted tables.
  
  Throws:
  
  IOException - for I/O issues during parsing.

Class StreamParser

High-level steps

Field Summary

Fields inherited from class com.extractpdf4j.parsers.BaseParser

Constructor Summary

Method Summary

Methods inherited from class com.extractpdf4j.parsers.BaseParser

Methods inherited from class java.lang.Object

Constructor Details

StreamParser

StreamParser

Method Details

parsePage

parse