com.extractpdf4j.parsers.BaseParser

Direct Known Subclasses:: HybridParser, LatticeParser, OcrStreamParser, StreamParser

public abstract class BaseParser extends Object

BaseParser

Abstract base for all PDF table parsers in com.extractpdf4j. Concrete implementations (e.g., StreamParser, LatticeParser, OcrStreamParser, HybridParser) should extend this class and implement parsePage(int).

Responsibilities

Holds common configuration shared by all parsers (file path, page ranges, flags).
Provides a final, high-level parse() that resolves the page selection and delegates work to parsePage(int).

Page selection contract

Page ranges are provided as a human-friendly string via pages(String). The format supports values such as "1", "2-5", "1,3-4", and "all". The helper PageRange.parse(..) converts this into a list of integers. Implementations must honor the following convention:

If parsePage(-1) is called, it indicates all pages should be parsed.
Otherwise, parsePage(p) is called once per requested page number p.

Thread-safety

Instances are not inherently thread-safe. Create one parser instance per input file or synchronize external access if you share state across threads.

Since:: 2025
Author:: Mehuli Mukherjee

Field Summary

Fields

Modifier and Type

Field

Description

protected final String

filepath

Absolute or relative path to the PDF file being parsed.

protected String

pages

Page selection string, defaulting to "1".

protected boolean

stripText

Whether to normalize/strip text (e.g., trim, collapse whitespace) in stream-based extraction.
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

BaseParser()

Constructs a parser for in-memory processing.

protected

BaseParser(String filepath)

Constructs a parser for the given PDF file.
Method Summary

Modifier and Type

Method

Description

protected List<Table>

finalizeResults(List<Table> tables, String sourcePath)

Normalizes parser output for "no tables" situations.

BaseParser

pages(String pages)

Sets the pages to parse.

List<Table>

parse()

Parses the configured pages from the PDF file.

abstract List<Table>

parse(org.apache.pdfbox.pdmodel.PDDocument document)

Parses a previously loaded PDF document.

protected abstract List<Table>

parsePage(int page)

Parses a single page or the entire document.

BaseParser

stripText(boolean strip)

Enables or disables text normalization for stream-style extraction.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- filepath
  
  protected final String filepath
  
  Absolute or relative path to the PDF file being parsed. This may be null when processing in-memory documents.
- pages
  
  protected String pages
  
  Page selection string, defaulting to "1". Accepts formats like "1", "2-5", "1,3-4", or "all".
- stripText
  
  protected boolean stripText
  
  Whether to normalize/strip text (e.g., trim, collapse whitespace) in stream-based extraction. Implementations may choose how to interpret this flag.
Constructor Details
- BaseParser
  
  protected BaseParser(String filepath)
  
  Constructs a parser for the given PDF file.
  
  Parameters:
  
  filepath - path to the PDF file
- BaseParser
  
  protected BaseParser()
  
  Constructs a parser for in-memory processing. The filepath will be null.
Method Details
- pages
  
  public BaseParser pages(String pages)
  
  Sets the pages to parse. See the class docs for supported formats.
  
  Parameters:
  
  pages - page selection string (e.g., "all", "1", "2-5", "1,3-4")
  
  Returns:
  
  this parser (for chaining)
- stripText
  
  public BaseParser stripText(boolean strip)
  
  Enables or disables text normalization for stream-style extraction. Implementations may ignore this flag if not applicable.
  
  Parameters:
  
  strip - true to normalize/strip text, false to keep raw text
  
  Returns:
  
  this parser (for chaining)
- parse
  
  public List<Table> parse() throws IOException
  
  Parses the configured pages from the PDF file.
  This method resolves the page selection via PageRange.parse(pages) and then delegates to parsePage(int). If the parsed list contains only -1, parsePage(int) is called with -1 to indicate "all pages". Otherwise, it is called once for each requested page number.
  
  Returns:
  
  a list of Table instances extracted from the requested pages (possibly empty)
  
  Throws:
  
  IOException - if reading the file fails or a parsing error occurs
  
  Since:
  
  2025
- parsePage
  
  protected abstract List<Table> parsePage(int page) throws IOException
  
  Parses a single page or the entire document.
  Contract: If page == -1, the implementation must parse the entire document. For any non-negative value, the implementation must parse only the specified page index (1-based or 0-based is implementation-defined, but should be consistent across the codebase and documented in concrete classes).
  
  Parameters:
  
  page - page index to parse, or -1 to parse all pages
  
  Returns:
  
  a list of Table objects extracted from the requested page(s) (possibly empty)
  
  Throws:
  
  IOException - if an error occurs while parsing
  
  Since:
  
  2025
- finalizeResults
  
  protected List<Table> finalizeResults(List<Table> tables, String sourcePath)
  
  Normalizes parser output for "no tables" situations.
  If tables is null or empty, logs a concise message and returns Collections.emptyList(). Otherwise return the input list unchanged.
  
  Parameters:
  
  tables - tables collected for the requested page(s)
  
  sourcePath - path to the input PDF (logging only)
  
  Returns:
  
  a non-null list of tables
  
  Since:
  
  2025
- parse
  
  public abstract List<Table> parse(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
  
  Parses a previously loaded PDF document. This is the preferred method for in-memory processing.
  
  Parameters:
  
  document - The PDDocument to parse.
  
  Returns:
  
  A list of extracted tables.
  
  Throws:
  
  IOException - for I/O issues during parsing.

Class BaseParser

Responsibilities

Page selection contract

Thread-safety

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

filepath

pages

stripText

Constructor Details

BaseParser

BaseParser

Method Details

pages

stripText

parse

parsePage

finalizeResults

parse