com.extractpdf4j.parsers.HybridParser

public class HybridParser extends BaseParser

HybridParser

A high-level parser that tries multiple underlying strategies and returns the best table set for the requested page(s). Specifically, it runs:

StreamParser — text-position based parsing (good for digitally created PDFs)
LatticeParser — grid/line detection using OpenCV (good for ruled or scanned PDFs)
OcrStreamParser — OCR-backed stream parsing (good for image PDFs without text layer)

and chooses the result using a simple heuristic scoring function (see score(Table)).

Usage


 List<Table> tables = new HybridParser("path/to/file.pdf")
     .dpi(300f)       // optional, helps scans
     .debug(true)     // optional, write lattice/ocr debug artifacts
     .pages("all")    // "1", "2-5", "1,3-4", or "all"
     .parse();

Page selection contract

Inherits the BaseParser convention: if parsePage(int) is invoked with -1, the implementation must parse all pages. For any non-negative value, only that page is parsed. This class narrows its internal subparsers accordingly.

Thread-safety

Instances are not inherently thread-safe. Create one instance per input file or perform external synchronization if sharing across threads.

Since:: 2025
Author:: Mehuli Mukherjee

Field Summary

Fields inherited from class com.extractpdf4j.parsers.BaseParser
filepath, pages, stripText
Constructor Summary

Constructors

Constructor

Description

HybridParser()

Creates a HybridParser for in-memory processing.

HybridParser(String filepath)

Creates a HybridParser for the given PDF file path.
Method Summary

Modifier and Type

Method

Description

HybridParser

debug(boolean on)

Enables or disables debug outputs for lattice/OCR strategies.

HybridParser

debugDir(File dir)

Directory where debug artifacts should be written (lattice + OCR).

HybridParser

dpi(float dpi)

Sets DPI for image-based parsing (used by lattice + OCR strategies).

HybridParser

keepCells(boolean on)

Whether to preserve empty cells when reconstructing grids (lattice only).

HybridParser

minScore(double score)

Sets the minimum allowed average score across a list of tables.

BaseParser

pages(String pages)

Sets the page selection for this parser and propagates the same selection to all underlying strategies.

List<Table>

parse(org.apache.pdfbox.pdmodel.PDDocument document)

Parses a previously loaded PDF document.

protected List<Table>

parsePage(int page)

Runs stream, lattice, and OCR-backed stream for the requested page(s) and returns the best-scoring set of tables.

HybridParser

stripText(boolean strip)

Enables or disables text normalization for stream-style extraction across all underlying strategies.

Methods inherited from class com.extractpdf4j.parsers.BaseParser
finalizeResults, parse

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- HybridParser
  
  public HybridParser(String filepath)
  
  Creates a HybridParser for the given PDF file path.
  
  Parameters:
  
  filepath - path to the PDF file
- HybridParser
  
  public HybridParser()
  
  Creates a HybridParser for in-memory processing. The PDF document must be passed to the new parse() method.
Method Details
- dpi
  
  public HybridParser dpi(float dpi)
  
  Sets DPI for image-based parsing (used by lattice + OCR strategies).
  
  Parameters:
  
  dpi - dots per inch used for rasterization (e.g., 300f for scans)
  
  Returns:
  
  this parser
- debug
  
  public HybridParser debug(boolean on)
  
  Enables or disables debug outputs for lattice/OCR strategies.
  
  Parameters:
  
  on - true to enable, false to disable
  
  Returns:
  
  this parser
- keepCells
  
  public HybridParser keepCells(boolean on)
  
  Whether to preserve empty cells when reconstructing grids (lattice only).
  
  Parameters:
  
  on - true to keep empty cells
  
  Returns:
  
  this parser
- debugDir
  
  public HybridParser debugDir(File dir)
  
  Directory where debug artifacts should be written (lattice + OCR).
  
  Parameters:
  
  dir - destination directory
  
  Returns:
  
  this parser
- minScore
  
  public HybridParser minScore(double score)
  
  Sets the minimum allowed average score across a list of tables. If the list's average score is below this threshold, it will be rejected.
  
  Parameters:
  
  score - minimal score in [0, 1]
  
  Returns:
  
  this parser
- pages
  
  public BaseParser pages(String pages)
  
  Sets the page selection for this parser and propagates the same selection to all underlying strategies.
  
  Overrides:
  
  pages in class BaseParser
  
  Parameters:
  
  pages - page selection string (e.g., "all", "1", "2-5", "1,3-4")
  
  Returns:
  
  this parser
- stripText
  
  public HybridParser stripText(boolean strip)
  
  Enables or disables text normalization for stream-style extraction across all underlying strategies.
  
  Overrides:
  
  stripText in class BaseParser
  
  Parameters:
  
  strip - true to normalize/strip text, false to keep raw text
  
  Returns:
  
  this parser (for chaining)
- parsePage
  
  protected List<Table> parsePage(int page) throws IOException
  
  Runs stream, lattice, and OCR-backed stream for the requested page(s) and returns the best-scoring set of tables.
  If page == -1, each strategy is run across all pages. Otherwise, each strategy is temporarily narrowed to the single requested page (restoring the original page spec afterward).
  
  Specified by:
  
  parsePage in class BaseParser
  
  Parameters:
  
  page - page index to parse, or -1 to parse all pages
  
  Returns:
  
  the winning list of Table objects (possibly empty)
  
  Throws:
  
  IOException - if an underlying parser fails
- parse
  
  public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
  
  Description copied from class: BaseParser
  
  Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
  
  Specified by:
  
  parse in class BaseParser
  
  Parameters:
  
  document - The PDDocument to parse.
  
  Returns:
  
  A list of extracted tables.
  
  Throws:
  
  IOException - for I/O issues during parsing.

Class HybridParser

Usage

Page selection contract

Thread-safety

Field Summary

Fields inherited from class com.extractpdf4j.parsers.BaseParser

Constructor Summary

Method Summary

Methods inherited from class com.extractpdf4j.parsers.BaseParser

Methods inherited from class java.lang.Object

Constructor Details

HybridParser

HybridParser

Method Details

dpi

debug

keepCells

debugDir

minScore

pages

stripText

parsePage

parse