Package com.extractpdf4j.helpers
Class Ocr
java.lang.Object
com.extractpdf4j.helpers.Ocr
OCR helper utilities.
(Javadocs of class and methods updated to reflect new heuristic functionality.)
-
Nested Class Summary
Nested Classes -
Method Summary
Modifier and TypeMethodDescriptionstatic StringRuns OCR on a PNG file and returns plain text.static List<Ocr.OcrWord>static List<Ocr.OcrWord>static List<Ocr.OcrWord>ocrTsvHeuristically(String pngPath, String lang) Runs OCR on a PNG image using a heuristic to find the best Page Segmentation Mode (PSM).
-
Method Details
-
ocrTsvHeuristically
Runs OCR on a PNG image using a heuristic to find the best Page Segmentation Mode (PSM). It tries a predefined list of PSMs, finds the one with the best word coverage, and prints a summary of the best combination found.- Parameters:
pngPath- Path to the PNG image.lang- Language string for Tesseract (e.g., "eng", "por", "eng+fra").- Returns:
- A list of OCR words (possibly empty).
-
ocrPng
Runs OCR on a PNG file and returns plain text. (Legacy method) NOTE: This method does not use the new heuristic. -
ocrTsv
-
ocrTsv
-