Skip to content

Hybrid Parser

HybridParser is the most flexible parser mode in ExtractPDF4J.

It is intended as the best general-purpose default, especially when you do not know in advance whether a PDF is text-based, scanned, ruled, or mixed.

When to use HybridParser

Use HybridParser when:

  • you are unsure which parser is best
  • your batch contains mixed PDF types
  • some pages are text-based and others are scanned
  • you want one practical entry point for automation pipelines

For many production use cases, this is the safest starting point.

How it works

At a high level, HybridParser coordinates multiple parser strategies.

It can:

  • choose the most suitable strategy
  • combine results from multiple approaches
  • return a consistent List<Table> output
  • reduce the need for manual parser selection

This is useful when: - document quality varies - layouts change across files - input sources are inconsistent

In real systems, PDF inputs are rarely uniform.

You may receive: - clean exported statements - partially scanned PDFs - mixed-layout invoices - OCR-needed archival files - files where one strategy works for some pages but not others

HybridParser reduces operational guesswork by giving you a stronger default path.

Example

import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.HybridParser;

import java.util.List;

public class HybridExample {
    public static void main(String[] args) throws Exception {
        List<Table> tables = new HybridParser("mixed.pdf")
                .pages("all")
                .dpi(300f)
                .parse();

        System.out.println("Tables found: " + tables.size());
    }
}

Strengths

  • Best default for unknown inputs
  • Useful for mixed text/scanned batches
  • Reduces parser selection effort
  • Good for production ingestion pipelines

Limitations

HybridParser is broad and practical, but it is not magic.

For highly specialized documents, a direct parser may still be better:

  • use StreamParser for clearly text-based PDFs
  • use LatticeParser for strongly ruled tables
  • use OcrStreamParser for OCR-first recovery

If you already know the exact document type, a specialized parser may be more predictable.

For first-time users

Start with HybridParser.

For debugging

Once you understand the document better: - switch to a specialized parser if needed - compare outputs - keep the better strategy for that document family

For production

Use:

  • HybridParser as the default route
  • targeted overrides only when specific document classes need them

Good fit vs poor fit

Good fit

  • unknown PDFs

  • varied input sources

  • mixed batches

  • automation pipelines

Less ideal when

  • the document type is fully known and stable
  • you want tightly specialized extraction behavior
  • you are tuning one narrow layout family