Extractor¶
This page covers the common parser usage pattern in ExtractPDF4J.
Although there are multiple parser implementations, they are typically used in a similar way:
- create a parser instance
- configure extraction options
- call
parse() - work with the returned
List<Table>
Common construction pattern¶
A parser is generally created with a PDF path:
Then you optionally add configuration:
Then you call:
Full example¶
import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;
import java.util.List;
public class ExtractorExample {
public static void main(String[] args) throws Exception {
List<Table> tables = new HybridParser("statement.pdf")
.pages("1-3")
.dpi(300f)
.parse();
for (Table table : tables) {
System.out.println(table.toCSV(','));
}
}
}
Common fluent methods¶
Depending on parser type, you may use methods like:
pages(...)¶
Used to restrict extraction to specific pages.
dpi(...)¶
Used mainly for scanned/image-based parsing.
debug(...)¶
Enables debug output.
debugDir(...)¶
Controls where debug artifacts are written.
keepCells(...)¶
Useful when preserving explicit cell structure is important.
CSV conversion¶
Once a Table is returned, a common next step is converting it to CSV:
This is useful for:
- quick inspection
- local validation
- downstream file generation
- pipeline handoff
Writing CSV to disk¶
import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.StreamParser;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
public class CsvWriteExample {
public static void main(String[] args) throws Exception {
List<Table> tables = new StreamParser("report.pdf")
.pages("1")
.parse();
if (!tables.isEmpty()) {
Files.writeString(Path.of("out.csv"), tables.get(0).toCSV(','));
}
}
}
Error handling guidance¶
In production usage, you should handle cases such as:
- no tables found
- scanned input requiring OCR
- weak or malformed layouts
- unexpected document template changes
A simple defensive pattern:
Recommended usage strategy¶
First pass¶
Use HybridParser to get a baseline.
Stabilization phase¶
If a document family is well understood, switch to a more specialized parser if needed.
Production phase¶
Validate:
- expected table count
- required headers
- column count consistency
- output quality before ingestion