license: Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

layout: default title: External Links

External Links

This page lists projects that utilize PDFBox and articles that have been written about PDFBox. Please file an improvement issue to get new projects or articles added to this page, or to update the information on existing links.

Projects Using PDFBox

Project NameLicenseProject Description
AlfrescoLGPL - commercial services/support/training is availableAlfresco is an open source, open-standards content repository built by the most experienced content management team that includes the co-founder of Documentum.
Apache NutchApache License v2Apache Nutch is open source web-search software. It builds on Apache Lucene, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Apache TikaApache License v2Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Canoo WebtestBSD LikeFree OpenSource tool for XP-style acceptance testing of Java-based Web applications.
ECM REWOO ScopeCommercialREWOO Scope is an Enterprise Content Management (ECM) software to organize, structure and consolidate enterprise data. Apache PDFBox is an integral part to read and index PDF documents.
JomicGPLJomic is a viewer for comic book archives.
JpdfUnitApache License v2pdfUnit is a framework for testing a generated pdf document with the JUnit Test Framework.
Liferay PortalMITLiferay Portal is an open source portal that helps organizations collaborate more efficiently by providing a consolidated view of disparate applications.
LuceGeneArtistic LicenseLuceGene is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents.
LuteceBSD-likeLutece is a portal engine which allows you to easily create your websites or intranets based upon HTML,XML content.
MMBase Lucene ModuleMPLLucenemodule is a plugin (module) for the MMBase content management system that enables Lucene full text search through it's content, and thanks to PDFBox also PDF content.
OpenCmsLGPLOpenCms is a professional level Open Source Website Content Management System.
OpenSearchServerGPLv3An open source search engine and crawler based on best open source technologies. It is a modern search engine and a suite of high-powered full text search algorithms.
Orbeon PresentationServerLGPLOrbeon PresentationServer (OPS) is an open source J2EE-based platform for XML-centric web applications. OPS is built around XHTML, XForms, XSLT, XML pipelines, and Web Services, which makes it ideal for applications that capture, process and present XML data. Commercial consulting/training/support is available through orbeon.
PDFJuiceApache License 2.0This project provides some tools that help the user to extract structured information form PDF documents. Currently, the program is able to export them to HTML.
SearchBloxCommercialSearchBlox is a high-performance corporate search software designed for the Java 2 Enterprise Edition (J2EE) platform.
Semantic ScholarWeb BasedSemantic Scholar is a new service from AI2 for scientific literature search and discovery, focusing on semantics and textual understanding.
SimplexRepaginatorApache License v2Simplex Repaginator converts simplex-scanned PDFs into properly duplex-paginated PDFs and vice versa.
TerrierMPLTerrier is software for the rapid development of Web, intranet and desktop search engines.
Triboni GinkGOCommercialTriboni GinkGO is a highly scalable J2EE services platform that is based on a simple XML business object defintion and scripting language. Toghether with XSLT content centric web applications can be configured in a very short time.

Articles/Books

Article NameArticle Abstract
Build an eDoc Reader for your iPod
Part 1 - User Interface
Part 2 - Document Reading Engine
Part 3 - Integration with PDFBox
A three part article that discusses the implementation of the PodReader application. PodReader is Cocoa application written in Objective-C and article discusses how to use the Cocoa-Java bridge to integrate with the Java version of PDFBox.
Lucene In ActionA book that discusses integrating with the lucene search engine. One chapter discusses how to index various file formats and highlights PDFBox for indexing PDF documents.
Java Developers Journal - March 2005An article written by the lead developer of PDFBox discussing text extraction and AcroForm integration using PDFBox functionality.
Refactoring trends across N versions of N Java open source systems: an empirical studyThis article describes an empirical study of multiple versions of a range of open source Java systems in an attempt to understand whether refactoring occur and, if so, which types of refactoring were most (and least) common. PDFBox is used as a case study.