Hybrid Parser¶
HybridParser is the most flexible parser mode in ExtractPDF4J.
It is intended as the best general-purpose default, especially when you do not know in advance whether a PDF is text-based, scanned, ruled, or mixed.
When to use HybridParser¶
Use HybridParser when:
- you are unsure which parser is best
- your batch contains mixed PDF types
- some pages are text-based and others are scanned
- you want one practical entry point for automation pipelines
For many production use cases, this is the safest starting point.
How it works¶
At a high level, HybridParser coordinates multiple parser strategies.
It can:
- choose the most suitable strategy
- combine results from multiple approaches
- return a consistent
List<Table>output - reduce the need for manual parser selection
This is useful when: - document quality varies - layouts change across files - input sources are inconsistent
Why HybridParser is the recommended default¶
In real systems, PDF inputs are rarely uniform.
You may receive: - clean exported statements - partially scanned PDFs - mixed-layout invoices - OCR-needed archival files - files where one strategy works for some pages but not others
HybridParser reduces operational guesswork by giving you a stronger default path.
Example¶
import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;
import java.util.List;
public class HybridExample {
public static void main(String[] args) throws Exception {
List<Table> tables = new HybridParser("mixed.pdf")
.pages("all")
.dpi(300f)
.parse();
System.out.println("Tables found: " + tables.size());
}
}
Strengths¶
- Best default for unknown inputs
- Useful for mixed text/scanned batches
- Reduces parser selection effort
- Good for production ingestion pipelines
Limitations¶
HybridParser is broad and practical, but it is not magic.
For highly specialized documents, a direct parser may still be better:
- use StreamParser for clearly text-based PDFs
- use LatticeParser for strongly ruled tables
- use OcrStreamParser for OCR-first recovery
If you already know the exact document type, a specialized parser may be more predictable.
Recommended workflow¶
For first-time users
Start with HybridParser.
For debugging
Once you understand the document better: - switch to a specialized parser if needed - compare outputs - keep the better strategy for that document family
For production
Use:
- HybridParser as the default route
- targeted overrides only when specific document classes need them
Good fit vs poor fit¶
Good fit
-
unknown PDFs
-
varied input sources
-
mixed batches
-
automation pipelines
Less ideal when
- the document type is fully known and stable
- you want tightly specialized extraction behavior
- you are tuning one narrow layout family