Stream Parser¶

StreamParser is designed for text-based PDFs where the document already contains a usable text layer.

Instead of relying on OCR or visible table borders, it uses the positions of text elements to infer rows and columns.

When to use StreamParser¶

Use StreamParser when:

the PDF is digitally generated
you can highlight or copy text in a PDF viewer
table borders are absent or inconsistent
the structure is implied by alignment rather than visible grid lines

Common examples: - bank statements - generated financial reports - system-exported tables - machine-produced invoices

How it works¶

At a high level:

Read the PDF text layer
Collect text positions from the page
Group nearby text into rows
Infer column boundaries from alignment and spacing
Build table cells
Return List<Table>

Strengths¶

Fast for clean text PDFs
No OCR overhead
Works well on structured exported documents
Good default for digital reports and statements

Limitations¶

StreamParser can struggle when:

rows wrap unpredictably
columns drift across pages
sections are visually close but semantically separate
spacing is inconsistent
the file is scanned and lacks a real text layer

If that happens, try: - HybridParser - LatticeParser (if lines exist) - OcrStreamParser (if the page is image-based)

Example¶

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.StreamParser;

import java.util.List;

public class StreamExample {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new StreamParser("statement.pdf")
                .pages("1-3")
                .parse();

        if (!tables.isEmpty()) {
            System.out.println(tables.get(0).toCSV(','));
        }
    }
}

Best practices¶

Use page ranges to focus on the table-bearing pages
Normalize headers downstream when layouts vary slightly
Validate extracted columns before relying on production ingestion
Prefer HybridParser if you are not fully sure the input is text-based

Good fit vs poor fit¶

Good fit

consistent row spacing
stable column alignment
selectable text
repeated statement/report layouts

Poor fit

scans
photographs
image-only PDFs
heavy skew or layout noise
visually ruled tables where line structure is stronger than text alignment