Lattice Parser¶
LatticeParser is designed for ruled or grid-based tables where visible lines define the table structure.
This is especially useful for scanned PDFs, forms, or boxed tables where text alignment alone is not reliable.
When to use LatticeParser¶
Use LatticeParser when:
- the table has visible horizontal and vertical lines
- the document is scanned
- rows and columns are defined by borders
- structure is visually grid-based
Common examples: - boxed invoices - forms with table cells - ruled financial statements - scanned reports with explicit table lines
How it works¶
At a high level:
- Render the page to an image
- Detect horizontal and vertical lines
- Find line intersections (joints)
- Construct a cell grid
- Assign text into cells
- Return
List<Table>
This is useful when the table is visually obvious, even if the text layer is missing or weak.
Strengths¶
- Excellent for ruled tables
- Handles grid-heavy scans well
- Can preserve more explicit table structure
- Useful when text-only parsing fails
Limitations¶
LatticeParser can struggle when:
- borders are faint or broken
- the document is low-resolution
- lines are skewed or noisy
- the table is implied only by spacing, not borders
In those cases:
- increase DPI
- enable debug output
- try HybridParser
- consider OcrStreamParser if the text is readable but borders are weak
Example¶
import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.LatticeParser;
import java.io.File;
import java.util.List;
public class LatticeExample {
public static void main(String[] args) throws Exception {
List<Table> tables = new LatticeParser("scanned.pdf")
.pages("all")
.dpi(300f)
.keepCells(true)
.debug(true)
.debugDir(new File("out/debug"))
.parse();
System.out.println("Tables found: " + tables.size());
}
}
Why DPI matters¶
For scanned documents, resolution strongly affects:
- line detection
- cell boundary accuracy
- OCR text assignment quality
Recommended starting point:
300f
For difficult scans:
400fto450f
Higher DPI can improve accuracy, but increases CPU and memory usage.
Debug mode¶
Use debug mode when:
- line detection seems wrong
- cells are merging incorrectly
- borders are partially missing
- you want to inspect intermediate output
Typical settings:
Good fit vs poor fit¶
Good fit
- strong borders
- clearly ruled tables
- boxed cells
- structured forms
Poor fit
- borderless tables
- loosely aligned text-only layouts
- documents where spacing matters more than drawn lines