Skip to content

Models

This page describes the core output model used by ExtractPDF4J.

The most important model for consumers is:

  • Table

This is the main structure returned by parser operations.

Table

All parser flows typically return:

List<Table>

Each Table represents one extracted table from the PDF.

A Table is the bridge between extraction and downstream use cases such as:

  • CSV export
  • data normalization
  • validation
  • ingestion into applications or services

Common usage pattern

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;

import java.util.List;

public class ModelsExample {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new HybridParser("sample.pdf")
                .pages("all")
                .dpi(300f)
                .parse();

        if (!tables.isEmpty()) {
            Table first = tables.get(0);
            System.out.println(first.toCSV(','));
        }
    }
}

CSV export

A common operation on a Table is:

table.toCSV(',')

This is useful for:

  • quick manual review
  • writing to files
  • testing extraction quality
  • feeding CSV-based downstream pipelines

Example: writing CSV

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.StreamParser;

import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;

public class TableCsvExample {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new StreamParser("statement.pdf")
                .pages("1")
                .parse();

        if (!tables.isEmpty()) {
            Files.writeString(Path.of("statement.csv"), tables.get(0).toCSV(','));
        }
    }
}

Cells and structure

Depending on parser mode and configuration, a Table may reflect:

  • inferred row/column structure
  • explicit cell-grid structure
  • OCR-backed text grouping
  • lattice-detected cell boundaries

When using options like:

.keepCells(true)

the parser may preserve more explicit cell-level structure where supported.

Downstream normalization

After extraction, it is common to normalize:

  • header names
  • date formats
  • numeric formatting
  • empty values
  • column ordering

Typical examples:

  • mapping Txn DateDate
  • trimming whitespace
  • standardizing currency columns
  • dropping noise rows

Defensive handling

Always handle:

  • empty result sets
  • multiple tables
  • partial or noisy tables

Example:

if (tables.isEmpty()) {
    System.out.println("No tables found.");
} else {
    for (Table table : tables) {
        System.out.println(table.toCSV(','));
    }
}

A Table is often used together with:

  • parser configuration methods
  • CSV file writing
  • template validation logic
  • downstream mappers and schema normalizers

Javadocs

For exact type details and method signatures, use: