ExtractPDF4J¶

Java-native PDF table extraction for text-based, scanned, and image-heavy documents.

ExtractPDF4J is a production-focused Java library for extracting tables and structured data from PDFs in real-world conditions.

It is designed for documents where extraction often becomes unreliable in practice:

text-based PDFs with inconsistent layout
scanned PDFs with no usable text layer
ruled and borderless tables
mixed multi-page documents
OCR-heavy operational files

Whether you are processing invoices, statements, reports, forms, or internal business documents, ExtractPDF4J gives you multiple parsing strategies under one Java-first API.

Why ExtractPDF4J?¶

PDF table extraction is not one problem — it is a family of problems.

A single parser often works for ideal documents but fails when:

one PDF is text-based and the next is scanned
some tables are ruled while others are spacing-based
rows wrap across lines
headers drift between document versions
OCR quality changes across scans

ExtractPDF4J addresses this by providing multiple extraction modes built for different layout types:

StreamParser for text-based PDFs
LatticeParser for ruled and grid-based tables
OcrStreamParser for OCR-backed recovery
HybridParser for mixed or uncertain input

This gives you a more practical and production-ready extraction toolkit than a single-strategy approach.

Quick example¶

Start with HybridParser if you want the safest default for mixed or unknown PDFs.

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;

import java.util.List;

public class QuickStart {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new HybridParser("sample.pdf")
                .pages("all")
                .dpi(300f)
                .parse();

        if (!tables.isEmpty()) {
            System.out.println(tables.get(0).toCSV(','));
        }
    }
}