Class LatticeParser

java.lang.Object
com.extractpdf4j.parsers.BaseParser
com.extractpdf4j.parsers.LatticeParser

public class LatticeParser extends BaseParser
LatticeParser

Detects table structure by rasterizing pages and finding horizontal/vertical ruling lines with OpenCV. Reconstructs a cell grid, maps PDF text into cells, and optionally runs OCR for sparsely filled cells.

Pipeline

  1. Render page to image at renderDpi.
  2. Binarize for line detection (adaptive threshold).
  3. Extract horizontal/vertical lines via morphology; project to get line positions.
  4. Build grid from line intersections; map PDF glyphs to cell coords.
  5. Fallback OCR for cells if text coverage is low.
  6. Emit Table with grid + row/column boundaries.

Page indexing follows the BaseParser convention: this class expects parsePage(1) for the first page; parsePage(-1) means “all pages”.

  • Constructor Details

    • LatticeParser

      public LatticeParser(String filepath)
    • LatticeParser

      public LatticeParser()
      Creates a LatticeParser for in-memory processing. The PDF document must be passed to the parse() method.
  • Method Details

    • debug

      public LatticeParser debug(boolean on)
      Toggle debug overlays/artifacts.
    • keepCells

      public LatticeParser keepCells(boolean on)
      Keep empty cells in the final grid (useful for fixed layouts).
    • dpi

      public LatticeParser dpi(float dpi)
      Set rasterization DPI.
    • debugDir

      public LatticeParser debugDir(File dir)
      Set debug artifact directory.
    • parsePage

      @Deprecated protected List<Table> parsePage(int page) throws IOException
      Deprecated.
      This method loads the document from disk on every call. Prefer loading the PDDocument once and using parse(PDDocument).
      Description copied from class: BaseParser
      Parses a single page or the entire document.

      Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).

      Specified by:
      parsePage in class BaseParser
      Parameters:
      page - page index to parse, or -1 to parse all pages
      Returns:
      a list of Table objects extracted from the requested page(s) (possibly empty)
      Throws:
      IOException - if an error occurs while parsing
    • parse

      public List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Description copied from class: BaseParser
      Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
      Specified by:
      parse in class BaseParser
      Parameters:
      document - The PDDocument to parse.
      Returns:
      A list of extracted tables.
      Throws:
      IOException - for I/O issues during parsing.