Class BaseParser

java.lang.Object
com.extractpdf4j.parsers.BaseParser
Direct Known Subclasses:
HybridParser, LatticeParser, OcrStreamParser, StreamParser

public abstract class BaseParser extends Object
BaseParser

Abstract base for all PDF table parsers in com.extractpdf4j. Concrete implementations (e.g., StreamParser, LatticeParser, OcrStreamParser, HybridParser) should extend this class and implement parsePage(int).

Responsibilities

  • Holds common configuration shared by all parsers (file path, page ranges, flags).
  • Provides a final, high-level parse() that resolves the page selection and delegates work to parsePage(int).

Page selection contract

Page ranges are provided as a human-friendly string via pages(String). The format supports values such as "1", "2-5", "1,3-4", and "all". The helper PageRange.parse(..) converts this into a list of integers. Implementations must honor the following convention:

  • If parsePage(-1) is called, it indicates all pages should be parsed.
  • Otherwise, parsePage(p) is called once per requested page number p.

Thread-safety

Instances are not inherently thread-safe. Create one parser instance per input file or synchronize external access if you share state across threads.

Since:
2025
Author:
Mehuli Mukherjee
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected final String
    Absolute or relative path to the PDF file being parsed.
    protected String
    Page selection string, defaulting to "1".
    protected boolean
    Whether to normalize/strip text (e.g., trim, collapse whitespace) in stream-based extraction.
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    protected
    Constructs a parser for in-memory processing.
    protected
    BaseParser(String filepath)
    Constructs a parser for the given PDF file.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected List<Table>
    finalizeResults(List<Table> tables, String sourcePath)
    Normalizes parser output for "no tables" situations.
    pages(String pages)
    Sets the pages to parse.
    Parses the configured pages from the PDF file.
    abstract List<Table>
    parse(org.apache.pdfbox.pdmodel.PDDocument document)
    Parses a previously loaded PDF document.
    protected abstract List<Table>
    parsePage(int page)
    Parses a single page or the entire document.
    stripText(boolean strip)
    Enables or disables text normalization for stream-style extraction.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • filepath

      protected final String filepath
      Absolute or relative path to the PDF file being parsed. This may be null when processing in-memory documents.
    • pages

      protected String pages
      Page selection string, defaulting to "1". Accepts formats like "1", "2-5", "1,3-4", or "all".
    • stripText

      protected boolean stripText
      Whether to normalize/strip text (e.g., trim, collapse whitespace) in stream-based extraction. Implementations may choose how to interpret this flag.
  • Constructor Details

    • BaseParser

      protected BaseParser(String filepath)
      Constructs a parser for the given PDF file.
      Parameters:
      filepath - path to the PDF file
    • BaseParser

      protected BaseParser()
      Constructs a parser for in-memory processing. The filepath will be null.
  • Method Details

    • pages

      public BaseParser pages(String pages)
      Sets the pages to parse. See the class docs for supported formats.
      Parameters:
      pages - page selection string (e.g., "all", "1", "2-5", "1,3-4")
      Returns:
      this parser (for chaining)
    • stripText

      public BaseParser stripText(boolean strip)
      Enables or disables text normalization for stream-style extraction. Implementations may ignore this flag if not applicable.
      Parameters:
      strip - true to normalize/strip text, false to keep raw text
      Returns:
      this parser (for chaining)
    • parse

      public List<Table> parse() throws IOException
      Parses the configured pages from the PDF file.

      This method resolves the page selection via PageRange.parse(pages) and then delegates to parsePage(int). If the parsed list contains only -1, parsePage(int) is called with -1 to indicate "all pages". Otherwise, it is called once for each requested page number.

      Returns:
      a list of Table instances extracted from the requested pages (possibly empty)
      Throws:
      IOException - if reading the file fails or a parsing error occurs
      Since:
      2025
    • parsePage

      protected abstract List<Table> parsePage(int page) throws IOException
      Parses a single page or the entire document.

      Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).

      Parameters:
      page - page index to parse, or -1 to parse all pages
      Returns:
      a list of Table objects extracted from the requested page(s) (possibly empty)
      Throws:
      IOException - if an error occurs while parsing
      Since:
      2025
    • finalizeResults

      protected List<Table> finalizeResults(List<Table> tables, String sourcePath)
      Normalizes parser output for "no tables" situations.

      If tables is null or empty, logs a concise message and returns Collections.emptyList(). Otherwise return the input list unchanged.

      Parameters:
      tables - tables collected for the requested page(s)
      sourcePath - path to the input PDF (logging only)
      Returns:
      a non-null list of tables
      Since:
      2025
    • parse

      public abstract List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
      Parameters:
      document - The PDDocument to parse.
      Returns:
      A list of extracted tables.
      Throws:
      IOException - for I/O issues during parsing.