OCR Stream Parser¶
OcrStreamParser is designed for scanned or image-heavy PDFs where the text layer is missing, weak, or unusable.
It uses OCR to recover text and then applies stream-style interpretation to build tabular output.
When to use OcrStreamParser¶
Use OcrStreamParser when:
- the PDF is a scan
- text cannot be selected in the PDF viewer
- the text layer is corrupt or incomplete
- OCR can recover readable content from the page image
Common examples: - scanned invoices - scanned bank statements - photographed documents - image-heavy archival PDFs
How it works¶
At a high level:
- Render the page to an image
- Run OCR on the rendered content
- Recover text blocks and positions
- Group text into row/column-like structure
- Build tables from OCR output
- Return
List<Table>
This gives you a way to extract tables even when the original PDF has no usable embedded text.
Strengths¶
- Works on image-only PDFs
- Useful for legacy scans
- Helps recover structure where no text layer exists
- Good fallback for OCR-readable documents
Limitations¶
OcrStreamParser depends on OCR quality.
It can degrade when: - the scan is blurry - the page is skewed - text is faint or noisy - the language data is missing - resolution is too low
If visible table borders are strong, LatticeParser may perform better.
Example¶
import com.extractpdf4j.helpers.Table;
import com.extractpdf4j.parsers.OcrStreamParser;
import java.util.List;
public class OcrStreamExample {
public static void main(String[] args) throws Exception {
List<Table> tables = new OcrStreamParser("scan.pdf")
.pages("1-2")
.dpi(300f)
.parse();
if (!tables.isEmpty()) {
System.out.println(tables.get(0).toCSV(','));
}
}
}
OCR setup notes¶
For OCR-backed parsing, you may need:
- Tesseract language data
TESSDATA_PREFIXif language data is not automatically found
Example:
DPI guidance¶
OCR quality often improves when you render at a better DPI.
Recommended starting point:
300f
For low-quality scans:
400fto450f
Trade-off:
- higher DPI may improve recognition
- but increases processing cost
When OCR Stream is a better choice than Lattice¶
Choose OcrStreamParser over LatticeParser when:
- text is readable after OCR
- borders are weak or absent
- layout is not strongly ruled
- you need text-first recovery rather than line-first reconstruction