Class Ocr

java.lang.Object
com.extractpdf4j.helpers.Ocr

public final class Ocr extends Object
OCR helper utilities. (Javadocs of class and methods updated to reflect new heuristic functionality.)
  • Method Details

    • ocrTsvHeuristically

      public static List<Ocr.OcrWord> ocrTsvHeuristically(String pngPath, String lang)
      Runs OCR on a PNG image using a heuristic to find the best Page Segmentation Mode (PSM). It tries a predefined list of PSMs, finds the one with the best word coverage, and prints a summary of the best combination found.
      Parameters:
      pngPath - Path to the PNG image.
      lang - Language string for Tesseract (e.g., "eng", "por", "eng+fra").
      Returns:
      A list of OCR words (possibly empty).
    • ocrPng

      public static String ocrPng(String pngPath)
      Runs OCR on a PNG file and returns plain text. (Legacy method) NOTE: This method does not use the new heuristic.
    • ocrTsv

      public static List<Ocr.OcrWord> ocrTsv(String pngPath)
    • ocrTsv

      public static List<Ocr.OcrWord> ocrTsv(String pngPath, String lang, String psm)