Skip to content

Extractor

This page covers the common parser usage pattern in ExtractPDF4J.

Although there are multiple parser implementations, they are typically used in a similar way:

  1. create a parser instance
  2. configure extraction options
  3. call parse()
  4. work with the returned List<Table>

Common construction pattern

A parser is generally created with a PDF path:

new HybridParser("input.pdf")

Then you optionally add configuration:

new HybridParser("input.pdf")
    .pages("1-3")
    .dpi(300f)

Then you call:

.parse()

Full example

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;

import java.util.List;

public class ExtractorExample {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new HybridParser("statement.pdf")
                .pages("1-3")
                .dpi(300f)
                .parse();

        for (Table table : tables) {
            System.out.println(table.toCSV(','));
        }
    }
}

Common fluent methods

Depending on parser type, you may use methods like:

pages(...)

.pages("1")
.pages("1-3")
.pages("1,3-5")
.pages("all")

Used to restrict extraction to specific pages.

dpi(...)

.dpi(300f)
.dpi(400f)

Used mainly for scanned/image-based parsing.

debug(...)

.debug(true)

Enables debug output.

debugDir(...)

.debugDir(new File("out/debug"))

Controls where debug artifacts are written.

keepCells(...)

.keepCells(true)

Useful when preserving explicit cell structure is important.

CSV conversion

Once a Table is returned, a common next step is converting it to CSV:

table.toCSV(',')

This is useful for:

  • quick inspection
  • local validation
  • downstream file generation
  • pipeline handoff

Writing CSV to disk

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.StreamParser;

import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;

public class CsvWriteExample {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new StreamParser("report.pdf")
                .pages("1")
                .parse();

        if (!tables.isEmpty()) {
            Files.writeString(Path.of("out.csv"), tables.get(0).toCSV(','));
        }
    }
}

Error handling guidance

In production usage, you should handle cases such as:

  • no tables found
  • scanned input requiring OCR
  • weak or malformed layouts
  • unexpected document template changes

A simple defensive pattern:

if (tables.isEmpty()) {
    System.out.println("No tables found.");
}

First pass

Use HybridParser to get a baseline.

Stabilization phase

If a document family is well understood, switch to a more specialized parser if needed.

Production phase

Validate:

  • expected table count
  • required headers
  • column count consistency
  • output quality before ingestion