CLI¶
ExtractPDF4J includes CLI support for running table extraction from the terminal.
Default behavior¶
If you do not pass --mode, the CLI defaults to:
hybrid
That means it behaves like:
Basic usage¶
Supported modes¶
Mode guidance¶
stream→ text-based PDFslattice→ ruled/grid tablesocrstream→ scanned/OCR-heavy pageshybrid→ best general-purpose default
Common flags¶
Page selection
Examples:
--pages 1→ page 1 only--pages 1-3→ pages 1, 2, 3--pages 1-3,5→ pages 1, 2, 3, and 5--pages all→ all pages
CSV separator¶
Output file¶
If omitted, output is written to STDOUT.
DPI¶
Recommended:
- use 300–450 for scanned PDFs
Debug output¶
Use these when you want intermediate artifacts for troubleshooting.
OCR mode¶
This controls how OCR helpers are selected.
Extra controls¶
Use these to tighten output control in more advanced workflows.
Example: scanned PDF with lattice mode¶
java -jar extractpdf4j-parser-<version>.jar scan.pdf \
--mode lattice \
--pages 1 \
--dpi 450 \
--ocr cli \
--debug \
--keep-cells \
--debug-dir debug_out \
--out p1.csv
Example: full-document hybrid extraction¶
java -jar extractpdf4j-parser-<version>.jar statement.pdf \
--mode hybrid \
--pages all \
--dpi 400 \
--out tables.csv
Output behavior¶
When --out is omitted:
Tables are printed to STDOUT as CSV.
When --out is provided:
Output is written to file.
When multiple tables are found:
Files may be numbered with suffixes, for example:
out-1.csvout-2.csv
When to use the CLI¶
The CLI is useful for:
- ad hoc extraction
- batch processing
- shell scripting
- CI jobs
- pre-ingestion validation
- debugging parser behavior before integrating the Java API