Skip to content

Installation

This page covers dependencies, runtime requirements, and native setup notes for ExtractPDF4J.


Requirements

  • Java: 17+
  • OS: Linux, macOS, or Windows
  • Build tool: Maven or Gradle

Recommended Installation (Using BOM)

Starting from v2.1.0, ExtractPDF4J provides a BOM (Bill of Materials) that simplifies dependency management.

Import the BOM once, then declare modules without specifying versions.

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>io.github.extractpdf4j</groupId>
      <artifactId>extractpdf4j-bom</artifactId>
      <version>2.1.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Then simply declare the modules you need.

Example:

<dependency>
  <groupId>io.github.extractpdf4j</groupId>
  <artifactId>extractpdf4j-service</artifactId>
</dependency>

Modules

ExtractPDF4J is split into multiple modules so you can depend only on what you need.

Module Description
extractpdf4j-core Core extraction utilities and shared models
extractpdf4j-service High-level API for table extraction
extractpdf4j-cli Command-line interface
extractpdf4j-bom Centralized dependency management

Maven Usage (Without BOM)

If you prefer not to use the BOM, you can declare modules directly.

Core helpers

<dependency>
  <groupId>io.github.extractpdf4j</groupId>
  <artifactId>extractpdf4j-core</artifactId>
  <version>2.1.0</version>
</dependency>

Service module

<dependency>
  <groupId>io.github.extractpdf4j</groupId>
  <artifactId>extractpdf4j-service</artifactId>
  <version>2.1.0</version>
</dependency>

CLI module

<dependency>
  <groupId>io.github.extractpdf4j</groupId>
  <artifactId>extractpdf4j-cli</artifactId>
  <version>2.1.0</version>
</dependency>

Gradle

Using the BOM:

implementation platform("io.github.extractpdf4j:extractpdf4j-bom:2.1.0")

implementation "io.github.extractpdf4j:extractpdf4j-service"

Without BOM:

implementation "io.github.extractpdf4j:extractpdf4j-core:2.1.0"
implementation "io.github.extractpdf4j:extractpdf4j-service:2.1.0"
implementation "io.github.extractpdf4j:extractpdf4j-cli:2.1.0"

Native Dependencies

ExtractPDF4J relies on several PDF and image-processing libraries.

These include:

  • Apache PDFBox
  • OpenCV
  • optionally Tesseract + Leptonica for OCR

Recommended Native Setup

The easiest setup is to use Bytedeco *-platform artifacts, which bundle native binaries automatically.

This avoids manual native library installation.


Manual Native Setup (Advanced)

If you provide your own native libraries:

Ensure they are available on your system library path.

Environment variables:

OS Variable
Linux LD_LIBRARY_PATH
macOS DYLD_LIBRARY_PATH
Windows PATH

OCR Setup

If OCR is enabled and Tesseract language data is not found, configure:

export TESSDATA_PREFIX=/path/to/tessdata

On Windows, set TESSDATA_PREFIX in System Environment Variables.


Local Documentation Setup

If you want to build the documentation locally:

pip install -r docs/requirements.txt
mkdocs build --strict

For local preview:

mkdocs serve

Installation Guidance by Scenario

Calling the Java API

Use:

extractpdf4j-service

Building command-line workflows

Use:

extractpdf4j-cli

Exposing extraction as a service

Use:

extractpdf4j-service

Building full applications

Use:

extractpdf4j-service

and add additional modules only if needed.


See OCR tuning for recommended OCR configuration and performance settings.