Getting Started¶

This guide helps you understand what ExtractPDF4J is for, when to use each parser, and how to get your first successful extraction.

What ExtractPDF4J solves¶

Many PDF extraction tools work for ideal documents, but real-world files are messy:

scanned pages with no text layer
inconsistent column alignment
wrapped descriptions
mixed structured and unstructured sections
partial tables across multiple pages

ExtractPDF4J is designed to handle these scenarios using multiple parsing strategies.

Supported document types¶

ExtractPDF4J is useful for:

invoices
bank statements
utility bills
transaction reports
financial statements
tabular operational reports
mixed scanned + text PDFs

Parser overview¶

StreamParser¶

Use this when the PDF has a clean text layer and the table structure can be inferred from text positions.

Best for: - digital statements - generated reports - text-based exports

LatticeParser¶

Use this when the PDF contains visible table borders, ruled cells, or grid-like structure.

Best for: - boxed invoices - ruled statements - scanned forms with table lines

OcrStreamParser¶

Use this when the document is scanned and lacks a usable text layer, but OCR can recover readable text.

Best for: - scanned statements - photographed PDFs - image-heavy pages

HybridParser¶

Use this when: - the document type is mixed, - you are unsure which strategy fits, - or you want a strong default for production.

Best for: - unknown PDFs - mixed text/scanned batches - automation pipelines

Recommended first run¶

For most users, start with HybridParser:

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;

import java.util.List;

public class FirstRun {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new HybridParser("sample.pdf")
                .pages("all")
                .dpi(300f)
                .parse();

        System.out.println("Tables found: " + tables.size());
    }
}

Sample PDFs¶

The project includes sample PDFs in /examples to help you test extraction behavior safely.

Examples include:

utility-bill style extraction
multi-page statement-style extraction

These are intended to help you understand:¶

page structure
table repetition
realistic output formatting

What success looks like¶

A successful extraction typically means: - the expected number of tables is found - rows and columns are aligned correctly - CSV output is usable downstream - headers are readable enough for normalization - scanned pages return stable OCR-backed content

Next steps¶

Go to Installation to set up dependencies
Then use Quickstart for real code examples
If you prefer terminal-based use, go to CLI