How It Works¶

ExtractPDF4J supports multiple extraction strategies because no single parser works well for every PDF layout.

Some PDFs contain a clean text layer. Others are scanned images. Some have clear ruled tables. Others are visually tabular but structurally inconsistent.

That is why ExtractPDF4J provides multiple parser modes and a coordinating hybrid strategy.

High-level pipeline¶

PDF (text-based) ──► PDFBox text positions ─┐
                                            ├─► StreamParser ──► Table (cells)
PDF (scanned)    ──► Render to image ───────┼─► LatticeParser ─► Table
                                            └─► OCR ───────────► OcrStreamParser

HybridParser ──► chooses / coordinates strategies and returns List<Table>

Core parser roles¶

BaseParser¶

BaseParser provides the shared workflow used by all concrete parsers.

It typically manages:

file path input
page selection
parser configuration
the common parse() pipeline
returning List<Table>

StreamParser¶

Uses PDF text positions to infer tabular structure from text layout.

Best when:

the PDF has a real text layer
rows and columns can be inferred from alignment
no OCR is needed
LatticeParser

Works by detecting visible table lines and constructing a grid.

Best when:

table borders are drawn
the document is scanned
a structured grid is visually present

OcrStreamParser¶

Uses OCR to recover text from image-based documents, then applies table-oriented interpretation.

Best when:

the document has no usable text layer
OCR can recover readable text
you still want row/column-style extraction

HybridParser¶

Combines or coordinates parser strategies.

Best when:

the document type is mixed
you want a strong default
production input varies across files

Why multiple parsers matter

Real-world PDFs vary by:

source system
scan quality
border visibility
text encoding
page complexity
OCR recoverability

A single extraction strategy often fails across a mixed batch. Multiple parser modes improve practical reliability.

Output model¶

All parsers return:

List<Table>

This gives you a consistent downstream contract even when the underlying extraction strategy differs.

Typical decision flow¶

Use StreamParser when

the PDF is digitally generated
you can select text in a PDF viewer
the layout is text-aligned

Use LatticeParser when

cell borders are visible
the table is ruled
line-based structure is strong

Use OcrStreamParser when

the PDF is scanned
the text layer is missing or unusable
OCR recovery is required

Use HybridParser when

you are unsure
the batch contains mixed document types
you want the safest general-purpose default

How It Works¶

High-level pipeline¶

Core parser roles¶

BaseParser¶

StreamParser¶

OcrStreamParser¶

HybridParser¶

Output model¶

Typical decision flow¶

Related pages¶