| Release 2.3.1 - ??? |
| |
| * Upgrade deeplearning4j to 1.0.0-M1.1 (TIKA-3458) |
| |
| Release 2.3.0 - 02/02/2022 |
| |
| * Upgrade to Apache POI 5.2.0. This is the first upgrade to POI |
| 5.x and represents a major refactoring. Users may experience |
| significantly more logging (TIKA-3164). |
| |
| * Upgrade to log4j2 2.17.1 (TIKA-3638). |
| |
| * Improve consistency in reporting package-entry divs across |
| all parsers for embedded files (TIKA-3644). This leads |
| to some more text (embedded file names) in files with |
| many embedded attachments. |
| |
| * Improve configuration of maps as params for parsers in |
| TikaConfig (TIKA-3645). |
| |
| * Improve identification of iWorks 13 files and add parsing |
| for thumbnails, some metadata and attachments (TIKA-3634). |
| Skip handling of .iwa files, which are not yet supported. |
| |
| * Limit the default in-memory processing (maxMainMemoryBytes) in |
| the PDFParser to 512MB as in the 1.x branch (TIKA-3642). |
| |
| * Added IDML Parser from 1.x series to 2.x series (TIKA-3188). |
| |
| * Extract annotation types and subtypes for PDFs into metadata (TIKA-3653). |
| |
| * Add metadata value for PDFs that contain 3D annotations (TIKA-3653). |
| |
| * Add parser for Translation Memory eXchange (TMX) files (TIKA-3660). |
| |
| * Add Bill of Materials (Maven BOM) for centralized module version management (TIKA-3367). |
| |
| |
| Release 2.2.1 - 12/19/2021 |
| |
| * Fix multithreading bug for ooxml files (TIKA-3627). |
| |
| * Upgrade log4j to 2.17.0 (TIKA-3625). |
| |
| * Upgrade to PDFBox 2.0.25 (TIKA-3622) |
| |
| * Fix bug that prevented metadata keys in the UnpackerResource |
| in tika-server (TIKA-3624). |
| |
| * Upgrade log4j to 2.16.0 (TIKA-3623) |
| |
| Release 2.2.0 - 12/13/2021 |
| |
| * Add support for OneNote files downloaded from O365 (TIKA-3446). |
| |
| * Fix logic bug in PipesServer that prevented concatenation of |
| content from attachments (TIKA-3609). |
| |
| * Improve extraction of embedded files from MSOffice files created |
| by non-Microsoft tools (TIKA-3526). |
| |
| * Added back ability to ignore load errors in TikaConfig (TIKA-3575). |
| |
| * Make SecureContentHandler and other parameters configurable in |
| AutoDetectParser programmatically and via tika-config.xml (TIKA-3594). |
| |
| * Fix default logging in tika-app in batch mode (TIKA-3589). |
| |
| * Fix bug that prevented specifying a config with the long |
| --config= option in tika-app in batch mode (TIKA-3589). |
| |
| * Fix thread starvation after numerous restarts in |
| PipesClient (TIKA-3588). |
| |
| * Fix race condition when starting multiple forked |
| servers on multiple ports (TIKA-3586). |
| |
| * Add timeout per task to be configured via headers |
| for tika-server's legacy endpoints /tika and /rmeta. |
| Note that this timeout greater than taskTimeoutMillis (TIKA-3582). |
| |
| * Add metadata item for whether or not a PDF has a collection/ |
| is a Portfolio PDF (TIKA-3579). |
| |
| * Add detection of ESRI Layer files (TIKA-3570). |
| |
| * Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types |
| (TIKA-3562 and TIKA-3563) |
| |
| * Remove duplicate "subject" metadata keys that were intended |
| for backwards compatibility within 1.x only (TIKA-3564). |
| |
| * Fix Open Office mime types to be subclasses of application/zip |
| and no longer require OPCPackageDetector-last ordering of zip |
| detectors (TIKA-3556). |
| |
| * Improve robustness and features of the httpfetcher (TIKA-3543) |
| |
| * Add optional fetch ranges to FetchEmitTuple to allow range fetching from, |
| e.g. http or s3 (TIKA-3542). |
| |
| * Exclude dependencies on jsoup and ehcache in ucar grib/cdm (TIKA-3003). |
| |
| |
| Release 2.1.0 - 08/18/2021 |
| |
| MAJOR CHANGES in 2.1.0: |
| |
| * Improved packaging for tika-parsers-extended. Use the tika-parser-scientific-package and |
| tika-parser-sqlite3-package artifacts if you want fat jars with dependencies. (TIKA-3510) |
| |
| * Tika app writes UTF-8 when an encoding is not specified; the legacy behavior |
| was UTF-8 on Mac OS, but System default on other OSs (TIKA-3515). |
| |
| * Change the default rendering strategy for PDFs from NO_TEXT to ALL (TIKA-3520). |
| |
| Other changes: |
| |
| * Fixed bug that pointed to the wrong tessdata directory if the user specified |
| a tesseract path but not also a tessdata path (TIKA-3518). |
| |
| * Fixed bug in Icu4j's encoding detector where it would return non-standard |
| names for charsets, e.g. IBM424_rtl is now returned as IBM424 (TIKA-3516). |
| |
| * Add a simple UrlFetcher in tika-core as a basic alternative |
| to tika-fetcher-http (TIKA-3527). |
| |
| * Add tika-pipes support for Google Cloud Storage (TIKA-3524). |
| |
| * Fix markup ordering errors in xhtml output for ODT files (TIKA-2242). |
| |
| * Fix serialization of embedded docs in OpenSearch emitter |
| and fix embedded documents not being indexed in some use |
| cases in the Solr emitter (TIKA-3490). |
| |
| * Add pipesClientId system property to PipesServer so that each |
| forked process can log to its own logger (TIKA-3480). |
| |
| * Add DateNormalizingMetadataFilter let users ensure that all dates |
| emitted to Solr/OpenSearch are in UTC. Users can configure which |
| timezone they'd like to use in cases where the file format does |
| not store a timezone (TIKA-3496). |
| |
| * Breaking change in the Solr and OpenSearch emitters. To achieve |
| the SKIP or CONCATENATE attachment strategy, modify the |
| parseMode in the pipesiterators or in the FetchEmitTuple (TIKA-3494). |
| |
| Release 2.0.0 - 07/07/2021 |
| |
| * Cleanup of fetcher integration with tika-server. |
| |
| * Update dependencies. |
| |
| Release 2.0.0-BETA - 05/19/2021 |
| |
| * Refactor pipes module for resilience |
| |
| * Add transcribe capability (TIKA-94). |
| |
| Release 2.0.0-ALPHA - 01/13/2021 |
| |
| BREAKING CHANGES in 2.0.0 |
| * General |
| * OCR is now triggered automatically for PDFs if tesseract |
| is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) |
| for how to disable OCR. |
| * We upgraded from log4j to log4j2 in tika-app, tika-server and anywhere else |
| we used to use log4j. |
| * By default, when rendering a page for OCR, the PDFParser does not render glyphs/text. |
| * Removed deprecated Metadata keys/properties (TIKA-1974). |
| * Removed deprecated PDFPreflightParser (TIKA-3437). |
| * Removed dangerous calls to read an inputstream or convert to bytes |
| without specifying a charset |
| * Parsers can be configured via tika-config.xml on instantiation. |
| We have moved away from configuration via .properties files because |
| of confusion among users. This affects the PDFParser, TesseractOCRParser |
| and the StringsParser. |
| * Changed namespaces of translator implementations (o.a.t.language.translate.impl) to avoid |
| split-package with tika-core |
| |
| * tika-parsers |
| * The parser modules have been broken into three main modules: |
| tika-parsers-standard, tika-parsers-extended and tika-parsers-ml. |
| Users may now need to add tika-parsers-extended's |
| tika-parser-scientific-module or tika-parser-sqlite3-module to tika-app and |
| tika-server to include parsers that used to be included by default |
| (for example: envi, gdal, grib, isatab, netcdf, sqlite3). |
| * PDFParser -- a) see above on OCR. b) This parser no longer warns if the jpeg2000 |
| dependency is not included. Tika now relies on PDFBox to log an error if a jpeg2000 |
| image should be processed but can't because the required external dependency is |
| not available. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io |
| for the non-ASF-2.0-compatible jpeg2000 library. |
| * CompressorParser -- users must add the com.github.luben:zstd-jni dependency to |
| the classpath to process zstd files. This is an optional library that is no longer bundled |
| in tika-parsers-standard-package because it contains native libs. |
| * ChmParser was moved to org.apache.tika.parser.microsoft.chm |
| * RTFParser was moved to org.apache.tika.parser.microsoft.rtf |
| * We are now using non-shaded versions of xmpcore with namespaces com.adobe.internal.* |
| vs com.adobe.*. |
| |
| * tika-app |
| * See above on default inclusion of only tika-parsers-standard. |
| |
| * tika-server |
| * tika-server now by default forks a process to isolate the parsing |
| in the forked process (this was called the -spawnChild option |
| in tika-1.x). Clients must now expect that tika-server |
| will restart on OOM, timeouts, crashes or after parsing a |
| large number of files. When this happens tika-server will restand and not |
| receive connections for brief periods. The less robust, legacy behavior |
| of not forking a process is available with "-noFork"= |
| * Most of tika-server's legacy configuration via the commandline has been moved |
| into configuration via a tika-config.xml file. |
| * tika-server's "enableFileUrl" has been removed in favor of a FileSystemFetcher. |
| * tika-server's /metadata endpoint requires tika-server-standard to write XMP/rdf output. |
| This output is not available in tika-server-core. |
| * In tika-server, for those parsers that can be configured per parse via a config object |
| passed in through the ParseContext, the config object will only update those fields |
| that the user has modified. The config object will no longer |
| fully reset all settings to the default settings per parse. |
| This has a more intuitive "update the base/configured settings" with |
| what has been changed in the config object. |
| |
| * tika-eval |
| * tika-eval's default profile and comparison reports no longer include tag reports. |
| Users can get the report configs that include tags (*-tags.xml): |
| https://github.com/apache/tika/tree/main/tika-eval/tika-eval-app/src/main/resources |
| |
| Release 1.27 - 06/30/2021 |
| |
| * Migrate MP4 parsing to Drew Noakes' metadata-extractor (TIKA-3459). |
| To revert to legacy parser turn off NoakesMP4Parser and turn on MP4Parser |
| via tika-config.xml. |
| |
| * Prevent rare infinite loop in tika-server's -spawnChild mode |
| when restart fails because of failure to bind to the port (TIKA-3441). |
| |
| * Improve likelihood that tesseract will not be orphaned on |
| jvm restart in tika-server (TIKA-3441). |
| |
| * Deprecate experimental PDFPreflightParser (TIKA-3437). |
| |
| * Apply encoding detection to zip entry names via Ryan421 (TIKA-3374). |
| |
| * Add json output for /tika endpoint in tika-server (TIKA-3352). |
| |
| * Tika's PDFParser should use the underlying file if one is passed in |
| via a TikaInputStream (TIKA-3350) |
| |
| Release 1.26 - 03/24/2021 |
| |
| * Fix thread safety bug in OpenOffice parser (TIKA-3334). |
| |
| * The "writeLimit" header now pertains to the combined characters |
| written per container document (and embedded documents) in the /rmeta |
| endpoint in tika-server (TIKA-3325); it no longer functions only |
| per container or embedded document. |
| |
| * Extract more embedded files in PDFs by recursively processing the |
| embedded file tree (TIKA-3332). |
| |
| * Allow for case insensitive headers for configuration of the PDFParser |
| and the TesseractOCRParser in tika-server via Subhajit Das (TIKA-3320). |
| |
| * Improve detection and parsing of XPS files (TIKA-3316). |
| |
| * General dependency upgrades (TIKA-3244). |
| |
| * Great optimization in ForkParser (TIKA-3237). |
| |
| * Fix parsing of emails attached to other emails in PST files (TIKA-3004). |
| |
| * MP3 parser should output the xmpDM:duration metadata as seconds not |
| milliseconds, consistent with the other Audio and Video parsers (TIKA-3318). |
| |
| * MP4 parser check if any of the Compatible Brands match when identifying |
| the subtype (TIKA-3310). |
| |
| Release 1.25 - 11/25/2020 |
| |
| * Fix inconsistent license in xmpcore (TIKA-3204). |
| |
| * General upgrades including some dependencies with |
| recently found security vulnerabilities (TIKA-3119). |
| |
| * Add detection and a parser for flat ODF files (TIKA-3159). |
| |
| * Add extraction of macros from ODF files (TIKA-3161). |
| |
| * Add mime detection for hprof and hprof text files (TIKA-3144). |
| |
| * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and TIKA-3146) |
| |
| * Create a metadata filter to trigger tika-eval stats post parsing (TIKA-3140) |
| |
| * Add a configurable metadata-filter for the RecursiveParserWrapper (TIKA-3137) |
| |
| * Parameterize writeLimit and maxEmbeddedResources for RecursiveParserWrapper |
| in tika-server (TIKA-3133) |
| |
| * Add status endpoint to tika-server (TIKA-3129). |
| |
| * Remove whitelist/blacklist terminology (TIKA-3120) |
| |
| * Add detection for parquet files (TIKA-3115). |
| |
| * Add detection and parsing for bplist (TIKA-3104). |
| |
| * Enable metadata value filtering for RecursiveParserWrapper (TIKA-3137) |
| |
| * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104). |
| |
| * Read hyperlinked images from ODT files (TIKA-3156). |
| |
| * Updated GrobidRESTParser to use new API location (TIKA-3191). |
| |
| * Add FileProfiler to tika-eval (TIKA-3216). |
| |
| * Add status endpoint to tika-server (TIKA-3129). |
| |
| * Improved handling of zip files with STORED entries with |
| data descriptor (TIKA-3196). |
| |
| * Add parsers for XLZ, IDML and MIF (TIKA-2976, TIKA-3188 and TIKA-3189). |
| |
| * Add the beginnings of a format-aware fuzzing module (TIKA-3083). |
| |
| * Add wrapper for Linux 'file' command for mime detection (TIKA-3215). |
| |
| * Added ability to skip parsing of embedded files in Tika Server (TIKA-3227). |
| |
| Release 1.24.1 - 4/17/2020 |
| |
| * Allow gzip compression of input and output streams for tika-server (TIKA-3073). |
| |
| Release 1.24 - 3/11/2019 |
| |
| * Add scripts to run tika-server as a service via Eric Pugh, |
| and add these scripts and jar as a new artifact in the release (TIKA-3010). |
| |
| * Upgrade Drew Noakes' metadata-extractor (TIKA-2952). |
| |
| * Enable optional extraction of structural tags in PDFs (alpha-grade) (TIKA-3026). |
| |
| * Tika app's --extract mode now outputs to STDOUT (TIKA-3035). |
| |
| * Add an optional Preflight parser for PDFs (TIKA-3055). |
| |
| * Improve detection of some zip-based formats (TIKA-3057). |
| |
| * Upgrade metadata-extractor to 2.13.0 (TIKA-2952). |
| |
| * Upgrade to POI 4.1.2 (TIKA-3047). |
| |
| * Extract XMP from PSD files (TIKA-3050). |
| |
| * Added XMLProfiler as an optional parser to profile XFA and XMP |
| in PDFs (TIKA-3045). |
| |
| * Extract inline images that rely on the DCT filter from PDFs (TIKA-3041). |
| |
| * Upgrade to PDFBox 2.0.19 (TIKA-3033). |
| |
| * Fix bug in ASM parser configuration (TIKA-2992). |
| |
| * Upgrade to java-libpst 0.9.3 (TIKA-2546). |
| |
| * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). |
| |
| Release 1.23 - 12/02/2019 |
| |
| * NOTE: The PDFParser now relies on OCRDPI to render page images when |
| users configure OCR on rendered page images. This will have the effect |
| of increasing rendered image size (TIKA-2624). |
| |
| * NOTE: tika-server no longer returns 415 for file types for which there |
| is no parser. |
| |
| * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). |
| |
| * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). |
| |
| * Upgrade to POI 4.1.1 (TIKA-2851). |
| |
| * Upgrade to PDFBox 2.0.17 (TIKA-2951). |
| |
| * Ensure that the PDFParser respects custom configuration of Tesseract |
| from tika-config.xml via Eric Pugh (TIKA-2970). |
| |
| * Add parser for XLIFF v1.2 files (TIKA-2975). |
| |
| * Add mime type detection support for WebAssembly (TIKA-2894), |
| HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); |
| and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). |
| |
| * Add an XLZ Parser (TIKA-2976). |
| |
| * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). |
| |
| Release 1.22 - 07/29/2019 |
| |
| * NOTE: tika-server no longer hard-codes the HtmlParser to handle |
| XML files (TIKA-2910). Users must now configure that behavior |
| via a tika-config.xml file. |
| |
| * NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints |
| between 0xF000 and 0XF0000 will cause an exception. |
| |
| * Add parser for HWP v5 files via SooMyung Lee (soomyung) and |
| JinSup Kim (ddoleye) (TIKA-2909). |
| |
| * Fix order of closing streams to avoid "Failed to close temporary resource" |
| exception in TesseractOCRParser (TIKA-2908). |
| |
| * Improve AutoDetectReader performance by caching encoding |
| detector (TIKA-1568). |
| |
| * Prevent RTFParser from outputting illegal tag combinations (TIKA-2889). |
| |
| * Fix RereadableInputStream to release all resources (TIKA-2903). |
| |
| * Implement custom language identifier in the tika-eval module based on |
| OpenNLP's language detector; add 18 languages and add common words |
| lists for all 121 languages (TIKA-2790). |
| |
| * Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896). |
| |
| * Fix RTFParser to extract more content (TIKA-2883). |
| |
| * Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898). |
| |
| * Improve StreamingZipContainerDetector for xltx, xltm and |
| several other file formats (TIKA-2886). |
| |
| Release 1.21 - 05/14/2019 |
| |
| * Add optional AUTO mode to OCR'ing of PDFs. If tesseract is installed |
| and on the path, and this option is selected programmatically |
| or via TikaConfig(), the PDFParser will use heuristics to decide |
| whether or not to run OCR per page on PDFs. (TIKA-2749) |
| |
| * The ZipContainerDetector's default behavior was changed to run |
| streaming detection up to its markLimit. Users can get the |
| legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream) |
| by setting markLimit=-1. The POIFSContainerDetector requires an underlying file; |
| it will try to spool the file to disk; if the file's length is > markLimit, |
| it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849). |
| |
| * Upgrade PDFBox to 2.0.14 (TIKA-2834). |
| |
| * Add CSV detection and replace TXTParser with TextAndCSVParser; |
| users can turn off CSV detection by excluding the TextAndCSVParser |
| and adding back the TXTParser via tika-config (TIKA-2833). |
| |
| * Add a CSVParser. CSV detection is currently based solely on filename |
| and/or information conveyed via Metadata (TIKA-2826). |
| |
| * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf, |
| guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso, |
| sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824) |
| |
| * Bundle xerces2 with tika-parsers (TIKA-2802). |
| |
| * Upgrade jaxb to 2.3.2 (TIKA-2819). |
| |
| * Upgrade jackson to 2.9.8 (TIKA-2717). |
| |
| * Update tika-eval's common tokens lists (TIKA-2822). |
| |
| * Handle bad tags in tika-eval more robustly (TIKA-2810). |
| |
| * Add reports for tags in tika-eval (TIKA-2809). |
| |
| * Extract text from SDT element within textboxes in .docx files (TIKA-2807). |
| |
| * Try to handle truncated OOXML files more robustly (TIKA-2765). |
| |
| Release 1.20 - 12/17/2018 |
| |
| * Upgrade to POI 4.0.1 (TIKA-2751). |
| |
| * Integrate/parameterize new angles handling in |
| PDFBox (TIKA-2779). |
| |
| * Upgrade to PDFBox 2.0.13 (TIKA-2788). |
| |
| * Prevent content within <style/> and <script/> elements |
| to be written in the ToTextContentHandler (TIKA-2550). |
| |
| * Switch child to parent communication to a shared memory-mapped |
| file in tika-server's -spawnChild mode. |
| |
| * Fix bug in tika-server when run in legacy mode (not -spawnChild) |
| that caused it to return 503 on documents submitted after |
| it hit an OutOfMemoryError (TIKA-2776). |
| |
| * Upgrade jaxb-runtime and javax.activation (TIKA-2778). |
| |
| * tika-app in batch mode now requires an interrupt or |
| kill signal to the parent process to stop the parent |
| and the child processes (TIKA-2780). |
| |
| * Bulk upgrade of dependencies (TIKA-2775). |
| |
| * Improve language id efficiency in tika-eval (TIKA-2777). |
| |
| * Upgrade sqlite "provided" dependency to 3.25.2 (TIKA-2773). |
| |
| * Remove duplication of notes in PPT slides (TIKA-2735) |
| |
| * Use -javaHome or $JAVA_HOME (if they exist) when |
| spawning child in tika-server's -spawnChild mode. |
| |
| * Fixed closing of styles around Hyperlinks in Word Parser |
| Contributed by Ronan O'Sullivan (TIKA-2599). |
| |
| Release 1.19.1 - 10/4/2018 |
| |
| * Update PDFBox to 2.0.12, jempbox to 1.8.16 |
| and jbig2 to 3.0.2 (TIKA-2745). |
| |
| * Fix regression in parser for MP3 files (TIKA-2730). |
| |
| * Updated Python Dependency Check for TesseractOCR (TIKA-2740). |
| |
| * Improve SAXParser robustness (TIKA-2727). |
| |
| * Remove dependency on slf4j-log4j12 by upgrading jmatio (TIKA-2742). |
| |
| * Replace com.sun.xml.bind:jaxb-impl and jaxb-core with |
| org.glassfish.jaxb:jaxb-runtime and jaxb-core (TIKA-2743) |
| |
| Release 1.19 - 9/14/2018 |
| |
| * Require Java 8 (TIKA-2679). |
| |
| * Enable building with Java 11 (TIKA-2668) |
| |
| * Add an option to make tika-server robust against infinite loops, |
| OOMs, and memory leaks (TIKA-2725). |
| |
| * Allow configuration of the Tesseract parser via the standard |
| tika-config.xml options (TIKA-2705). |
| |
| * Improve handling of empty cells across table-based |
| formats (TIKA-2479). |
| |
| * Add a Standards compliant HTML encoding detector |
| via Gerard Bouchar (TIKA-2673). |
| |
| * Improved XML parsing -- limited default entity expansions to 20. |
| To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to |
| your commandline. |
| |
| * Mime magic improvements for Olympus RAW (TIKA-2658), interpreted |
| server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723) |
| |
| * Add absolute timeout to ForkParser rather than testing |
| for active (TIKA-2656). |
| |
| * Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655). |
| |
| * Allow the ForkParser to specify a directory containing tika-app.jar |
| for use by the ForkServer. This allows users to keep most of the |
| parser dependencies out of their code; and it allows for an easy |
| addition of optional jars for Parser dependencies, |
| such as the xerial sqlite jar (TIKA-2653). |
| |
| * Use a pool for SAXParsers and DOMBuilders rather than creating |
| a new parser/builder for every parse. |
| For better performance, set XMLReaderUtils.setPoolSize() to the |
| number of threads you're using with Tika (TIKA-2645). |
| |
| * Add the RecursiveParserWrapperHandler to improve the RecursiveParserWrapper |
| API slightly (TIKA-2644). |
| |
| * Upgraded to Commons-Compress 1.18 (TIKA-2707). |
| |
| * Upgraded to Apache POI 4.0.0 (TIKA-2552). |
| |
| * Upgraded to Apache PDFBox 2.0.11 (TIKA-2681). |
| |
| * Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672). |
| |
| * Upgraded jmatio to 1.4 (TIKA-2667) |
| |
| * Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples (TIKA-2695). |
| |
| * Upgraded junrar to 1.0.1 (TIKA-2664). |
| |
| * Numerous other upgrades (TIKA-2692). |
| |
| * Excluded Spring as a transitive dependency (TIKA-2721). |
| |
| Release 1.18 - 4/20/2018 |
| |
| * Upgrade jackson to 2.9.5 (TIKA-2634). |
| |
| * Add support for brotli (TIKA-2621). |
| |
| * Upgrade PDFBox to 2.0.9 and include new jbig2-imageio |
| from org.apache.pdfbox (TIKA-2579 and TIKA-2607). |
| |
| * Support for TIFF images in PDF files (TIKA-2338) |
| |
| * Detection of full encrypted 7z files (TIKA-2568) |
| |
| * Various new mimes and typo fixes in tika-mimetypes.xml |
| via Andreas Meier (TIKA-2527). |
| |
| * Revert to listenForAllRecords=false in ExcelExtractor |
| via Grigoriy Alekseev (TIKA-2590) |
| |
| * Add workaround to identify TIFFs that might confuse |
| commons-compress's tar detection via Daniel Schmidt |
| (TIKA-2591) |
| |
| * Ignore non-IANA supported charsets in HTML meta-headers |
| during charset detection in HTMLEncodingDetector |
| via Andreas Meier (TIKA-2592) |
| |
| * Add detection and parsing of zstd (if user provides |
| com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576) |
| |
| * Allow for RFC822 detection for files starting with "dkim-" |
| and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587) |
| |
| * Extract xlsx files embedded in OLE objects within PPT and PPTX |
| via Brian McColgan (TIKA-2588). |
| |
| * Extract files embedded in HTML and javascript inside HTML |
| that are stored in the Data URI scheme (TIKA-2563). |
| |
| * Extract text from grouped text boxes in PPT (TIKA-2569). |
| |
| * Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559) |
| |
| * RFC822 with multipart/mixed, first text element should be treated |
| as the main body of the email, not an attachment (TIKA-2547). |
| |
| * Swap out com.tdunning:json for com.github.openjson:openjson to avoid |
| jar conflicts (TIKA-2556). |
| |
| * No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551). |
| |
| * Require Java 8 (TIKA-2553). |
| |
| * Add a parser for XPS (TIKA-2524). |
| |
| * Mime magic for Dolby Digital AC3 and EAC3 files |
| |
| * Fixed bug where TesseractOCRParser ignores configured ImageMagickPath, |
| and set rotation script to ignore Python warnings (TIKA-2509) |
| |
| * Upgrade geo-apis to 3.0.1 (TIKA-2535) |
| |
| * Mime definition and magic improvements for text-based programming |
| and config formats (TIKA-2554, TIKA-2567, TIKA-1141) |
| |
| * Added local Docker image build using dockerfile-maven-plugin to allow |
| images to be built from source (TIKA-1518). |
| |
| * Support for SAS7BDAT data files (TIKA-2462) |
| |
| * Handle .epub files using .htm rather than .html extensions for the |
| embedded contents (TIKA-1288) |
| |
| * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629) |
| |
| * For sparse XLSX and XLSB files, always output missing cells to |
| the left of filled ones (matching XLS), and optionally output |
| missing rows on all 3 formats if requested via the |
| OfficeParserContext (TIKA-2479) |
| |
| Release 1.17 - 12/8/2017 |
| |
| ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN |
| ON Java 7. The next versions will require Java 8*** |
| |
| * Fix thread-safety in ChmExtractor (TIKA-2519). |
| |
| * Upgrade cxf to 3.0.16 (TIKA-2516). |
| |
| * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213). |
| |
| * Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512). |
| |
| * Cache TikaConfig in EmbeddedDocumentUtil for better performance |
| in documents with large number of attachments (TIKA-2511). |
| |
| * Extract media files from ooxml (TIKA-2510). |
| |
| * Standardize the way the Image and Video captioning |
| dockers and extraction work (TIKA-2400, GitHub-208) |
| |
| * Upgrade to xmpcore 5.1.3 (TIKA-2034). |
| |
| * Upgrade to metadata-extractor 2.10.1 (TIKA-2486). |
| |
| * Upgrade to OpenNLP 1.8.3 (TIKA-2502). |
| |
| * Upgrade to Jackson 2.9.2 (TIKA-2501). |
| |
| * Catch potential NPE in getting InputStream for attachments |
| in PST file (TIKA-2488). |
| |
| * Upgrade to PDFBox 2.0.8 (TIKA-2489). |
| |
| * Allow configuration of markLimit in EncodingDetectors |
| via tika-config.xml (TIKA-2485). |
| |
| * RFC822Parser now selects the best alternative for |
| multipart/alternative body components. This aligns with the |
| behavior of the OutlookParser (TIKA-2478). Users can select |
| legacy behavior via the "extractAllAlternatives" parameter |
| in the RFC822 parser definition in tika-config.xml. |
| |
| * Narrow mime detection for ms-owner files and add detection |
| for .nls files (TIKA-2469). |
| |
| * Fix bug in CharsetDetector that led to different detected charsets |
| depending on whether user setText with a byte[] or an InputStream |
| via Sean Story (TIKA-2475). |
| |
| * Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466). |
| |
| * Upgrade to POI 3.17 (TIKA-2429). |
| |
| * Enabling extraction of standard references from text (TIKA-2449). |
| |
| * Load external custom mimetypes XML from system property |
| tika.custom-mimetypes (TIKA-2460). |
| |
| * Extract number of tiffs in a multi-page tiff (TIKA-2451). |
| |
| * Fix detection of emails extracted from mbox (TIKA-2456). |
| |
| * Add OverrideDetector and allow PSTParser to specify body content type |
| as text or html -- to avoid incorrect auto-detection of |
| rfc/mbox, etc. (TIKA-2454) |
| |
| * AutoDetectParser throws ZeroByteFileException for zero-byte files after |
| detection on the file extension (TIKA-2450). |
| |
| * Extract phonetic runs in docx with experimental SAX parser (TIKA-2448). |
| |
| * Extract phonetic runs from xls and allow users to turn off extraction |
| of phonetic runs in both xls and xlsx (TIKA-2440). |
| |
| * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). |
| Fix unit tests to be robust against different locales in OOXML |
| and ExcelParser (TIKA-2438). |
| |
| * Upgrade to PDFBox 2.0.7 (TIKA-2431). |
| |
| * Tika now has support for automatic image captioning, that |
| combines Computer Vision and Natural Language Processing to |
| automatically generate a readable caption for an image |
| (TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189). |
| |
| * Add TestCorruptedFiles to allow devs to test parsers against |
| corrupted input files (TIKA-2430). |
| |
| * Correct Mimetype definition for Windows batch files (CMD and BAT) |
| which are the same (TIKA-2445) |
| |
| * PSDParser memory use improvements (TIKA-2447) |
| |
| * Add underline extraction from Word documents (doc/docx) via Stuart Hendren |
| as well as strikethrough extraction in docx (TIKA-2347, GitHub-173) |
| |
| * Corrected Tesseract OCR rotation.py script and made it a configurable |
| option via Peter Weiss (TIKA-2385) |
| |
| Release 1.16 - 7/7/2017 |
| |
| * Exclude jj2000 from edu.ucar grip to avoid potential |
| license conflicts with ASL 2.0 |
| |
| * Add Age recognition using Ensemble model for Linear regression |
| and Apache OpenNLP Maximum Entropy. Tika can now detect age from |
| text (TIKA-1988). |
| |
| * Add Tika Deep Learning support for the VGG16 model for |
| Very Deep Convolutional Networks for Large-Scale Image Recognition. |
| Now Tika supports both Inception v3/v4 and VGG16 based image |
| recognition (TIKA-2298). |
| |
| * Extract macros from PPT (TIKA-2089). |
| |
| * Extract absolute path for last saved location when available |
| in .xlsx and .xlsb (TIKA-2335). |
| |
| * Rename SentimentParser to SentimentAnalysisParser to |
| prevent conflict with dependency (TIKA-2368). |
| |
| * tika-app now extracts inline images in PDFs by |
| default, and it includes a warning to users that this is not the |
| default behavior elsewhere in Tika (TIKA-2374). |
| |
| * Allow configurability of warnings for problems during |
| parser initialization (TIKA-2389). |
| |
| * Upgrade to Jackcess 2.1.8 (TIKA-2380). |
| |
| * Upgrade to POI 3.17-beta1 (TIKA-2336). |
| |
| * Remove non-ASL-2.0-compatible org.json (TIKA-1804). |
| |
| * Allow extraction of <script> elements in HTML as embedded "MACRO". |
| Users must turn this on via TikaConfig (TIKA-2391). |
| |
| * Allow users to turn off extraction of headers and footers |
| from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362) |
| |
| * Extract text from charts in .docx, .pptx, .xlsx and .xlsb |
| (TIKA-2254). |
| |
| * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb |
| (TIKA-1945). |
| |
| * Fix bug in tika-server that led to an attempt to close the |
| input stream twice (TIKA-2384). |
| |
| * Enable base32 encoding of digests and enable BouncyCastle implementations |
| of digest algorithms (TIKA-2386). |
| |
| * Add snap builds to codebase (TIKA-2401) |
| |
| * Canonical Mimetype of WAVE audio changed to match RFC 2361 defined |
| version, audio/vnd.wave, older audio/x-wav remains as an alias |
| |
| * Upgrade "provided" xerial to 3.19.3 (TIKA-2412). |
| |
| * Upgrade Gson to 2.8.1 (TIKA-2414). |
| |
| * Upgrade mime4j to 0.8.1 (TIKA-2413). |
| |
| * Mime magic improvements for GraphViz (TIKA-2422), HTML files which |
| claim to be XML but aren't quite valid XML (TIKA-2419) and QuickTime |
| / MP4 (TIKA-2418) |
| |
| Release 1.15 - 05/23/2017 |
| |
| * Tika now has a module for Deep Learning powered by the |
| DL4J toolkit. The initial included model is for InceptionV3 |
| and so using this module, natively in Java, Tika can use |
| Deep learning for metadata/text extraction from Images using |
| the power of the Inception model (Github-165). |
| |
| * A new parser for sentiment analysis using a categorical |
| (multi-class, anry, sad, neutral, like, love) and binary |
| (positive/negative) was added leveraging the USC data |
| science work (TIKA-2016). |
| |
| * Tika now has the ability to automatically detect objects in videos, |
| using OpenCV and Tensorflow (TIKA-2322). |
| |
| * Change default behavior to parse embedded documents even if the user |
| forgets to specify a Parser.class in the ParseContext (TIKA-2096). |
| Users who wish to parse only the container document should set |
| an EmptyParser as the Parser.class in the ParseContext. |
| |
| * Change default behavior of Office Parsers to _not_ extract |
| Macros. User needs to setExtractMacros to "true" (TIKA-2302). |
| |
| * Added tika-eval module (TIKA-1332). |
| |
| * Unified logging across Tika: SLF4J as logging API, Apache Log4j as |
| implementation with JCL and JUL bridges in standalone tools like |
| tika-app, tika-batch and tika-server (TIKA-2245). |
| |
| * Add parser for XLSB files (TIKA-1195). |
| |
| * Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247). |
| |
| * Add parsers for WordPerfect and QuattroPro (.qpw) files. |
| Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228). |
| |
| * Add experimental SAX parser for .pptx files. To select this parser, |
| set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210). |
| |
| * Add experimental SAX parser for .docx files. To select this parser, |
| set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191). |
| |
| * Add mime detection and parser for Word 2006ML format (TIKA-2179). |
| |
| * Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352). |
| |
| * Added "text-main" equivalent option to tika-server via |
| /tika/main (TIKA-2343). |
| |
| * Enabled configuration of the EncodingDetector used by |
| parsers that extend AbstractEncodingDetectorParser (TIKA-2273). |
| |
| * Prevent easily preventable OOMs for both detection and parsing |
| of some compression formats (TIKA-2330). |
| |
| * Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295). |
| |
| * Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269). |
| |
| * Official mime types for BMP, EMF and WMF have been registered with |
| IANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250) |
| |
| * Be more parsimonious with BufferedInputStreams via Josh Hight |
| (TIKA-2244). |
| |
| * Enable handling of hyphenated language codes in TesseractOCRParser |
| via Graham Russell (TIKA-2231). |
| |
| * Improve style tags in ODT (TIKA-2242). |
| |
| * Add container detection for embedded MSEquation files (TIKA-2238). |
| |
| * Add parsing of JBIG2 and extraction of JBIG2 from PDFs when |
| required dependencies are added to class path by user. |
| Contributed by Pascal Essiembre (TIKA-2232). |
| |
| * Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser |
| (TIKA-2224). |
| |
| * Add configurability of "preserve-interword-spacing" to |
| TesseractOCRParser (TIKA-2190). |
| |
| * Upgrade to PDFBox 2.0.6 and JempBox 1.8.13 (TIKA-2209/TIKA-2236/TIKA-2361). |
| |
| * Refactor MockParser to consolidate service loading |
| and mime types into tika-core/src/test (TIKA-2195). |
| |
| * Enabled extraction of embedded objects from headers, footers, |
| footnotes, endnotes and comments in legacy .docx parser (TIKA-2192). |
| |
| * Allow extraction of PDActions (including Javascript) from |
| PDFs (TIKA-2090). This is turned off by default. Users |
| must setExtractActions(true) on the PDFParserConfig. |
| |
| * Change default behavior in experimental .docx parser to ignore |
| deleted text to align with .doc (TIKA-2187). |
| |
| * Upgrade to POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329). |
| |
| * Allow configuration of timeout for ForkParser (TIKA-2170). |
| |
| * Add extraction of .jpx inline images from PDFs when required |
| dependencies are added by user to class path (TIKA-2175). |
| |
| * Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174). |
| |
| * Upgrade SQLite "provided" dependency to 3.16.1 (TIKA-2334). |
| |
| * Update Apache CXF version to 3.0.12 (TIKA-2292). |
| |
| * Add Lingo24 Language Detector (TIKA-2297). |
| |
| * Further mime magic for WebVTT (TIKA-1772) |
| |
| * Extend support for increased PSM options up to 13 for modern |
| versions of Tesseract (TIKA-2357). |
| |
| * Prevent potential resource leak by closing TrueTypeFont |
| via Cameron Rollheiser (TIKA-2370). |
| |
| Release 1.14 - 10/19/2016 |
| |
| * Extract all headers from MSG/RFC822 (TIKA-2122). |
| |
| * Upgrade metadata-extractor to 2.9.1 (TIKA-2113). |
| |
| * Extract PDF DocInfo metadata into separate keys to prevent |
| overwriting by XMP metadata (TIKA-2057). |
| |
| * Re-enable fileUrl for tika-server (TIKA-2081). If you choose, |
| to use this feature, beware of the security vulnerabilities! |
| See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271 |
| |
| * Add Tesseract's hOCR output format as an option, via Eric Pugh |
| (TIKA-2093) |
| |
| * Extract macros from MSOffice files (TIKA-2069). |
| |
| * Maintain passed-in mime in TXTParser (TIKA-2047). |
| |
| * Upgrade to POI.3-15 (TIKA-2013). |
| |
| * Upgrade to PDFBox 2.0.3 (TIKA-2051). |
| |
| * Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255 |
| and TIKA-2078) |
| |
| * Tika now is integrated with the Tensorflow library from Google |
| and it can use its Inception v3 image classification model to |
| identify objects in images (TIKA-1993). |
| |
| * Parser configuration is now type-safe and parameters for parsers |
| can have assigned types (TIKA-1508, TIKA-1986). |
| |
| * Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040). |
| |
| * Upgrade ICU4J charset detection components to fix multithreading |
| bug (TIKA-2041). |
| |
| * Upgrade to Jackcess 2.1.4 (TIKA-2039). |
| |
| * Maintain more significant digits in cells of "General" format |
| in XLS and XLSX (TIKA-2025). |
| |
| * Avoid mark/reset issues when extracting or detecting embedded resources |
| in RFC822 emails (TIKA-2037). |
| |
| * Improving accuracy of Tesseract for better extraction of numeric |
| and alphanumeric text from images (TIKA-2021, TIKA-2031). |
| |
| * Improve extraction of embedded documents from PPT, PPTX and XLSX |
| (TIKA-2026). |
| |
| * Add parser for applefile (AppleSingle) (TIKA-2022). |
| |
| * Add mime types, mime magic and/or globs for: |
| * Endnote Import File (TIKA-2011) |
| * DJVU files (TIKA-2009) |
| * MS Owner File (TIKA-2008) |
| * Windows Media Metafile (TIKA-2004) |
| * iCal and vCalendar (TIKA-2006) |
| * MBOX (TIKA-2042) |
| * Stata DTA (TIKA-2064) |
| |
| * Add configurable maximum threshold for number of events extracted |
| from the XMP Media Management Schema in JempboxExtractor (TIKA-1999). |
| |
| * Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994). |
| |
| * Add mime detection via Nick C and parser for DBF files (TIKA-1513). |
| |
| * Add mime detection and parsers for MSOffice 2003 XML Word |
| and Excel formats (TIKA-1958). |
| |
| * Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454). |
| |
| * Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358) |
| |
| Release 1.13 - 05/08/2016 |
| |
| * Upgrade to PDFBox 2.0.1 (TIKA-1285/TIKA-1959). |
| MAJOR CHANGES in PDFParser: |
| * The classic sequential parser is no longer available. |
| * Tiff files are no longer extracted by default. See |
| https://pdfbox.apache.org/2.0/dependencies.html#optional-components |
| for optional components to process Tiff files. |
| * Some truncated/corrupted files that had some content extracted |
| with 1.8.x may have no content extracted in 2.0.x (see TIKA-1912). |
| |
| * The MIT-NLP Information Extraction (MITIE) Named Entity |
| Recognition (NER) system is now supported in Tika |
| (TIKA-1913, GitHub-108). |
| |
| * Tika now supports the use of the Yandex translation |
| service (TIKA-1943, GitHub-106). |
| |
| * Tika now uses NER to extract scientific measurements |
| from text using either GROBID Quantities which uses |
| conditional random fields and NLTK which uses regular |
| expressesions (TIKA-1917, GitHub-104). |
| |
| * Fixed JournalParser to handle null responses from |
| GROBID and to log a message (TIKA-1925). |
| |
| * Refactored Language Detector into tika-landetect module, |
| added default N-Gram implementation, Optimaize Lang |
| Detector and MIT Text.jl implementation |
| (TIKA-1872, TIKA-1696, TIKA-1723). |
| |
| * Extract metadata from MP4 videos whether or not the |
| PooledTimeSeries parser is available via Aditya Dhulipala |
| (TIKA-1844). |
| |
| * Fix NPE when trying to get embedded image identifier in |
| WordParser (TIKA-1956). |
| |
| * Improvements to MIME database for detection of Scientific |
| and other formats present in the TREC-DD-Polar dataset |
| (TIKA-1881, GitHub-85, TIKA-1883, TIKA-1884, TIKA-1886, |
| TIKA-1882). |
| |
| * LinkContentHandler now extracts links from script tags |
| via Joseph Naegele (TIKA-1937). |
| |
| * Handle per page IOExceptions more robustly in PDFParser (TIKA-1948). |
| |
| * Upgrade commons-compress to 1.11 (TIKA-1949). |
| |
| * Add detection for embedded MSChart.Graph files (TIKA-1033). |
| |
| * Fix NPE in Sqlite parser from Nick C (TIKA-1927). |
| |
| * Fix NPE in Open Document parser from Nick C (TIKA-1916). |
| |
| * Upgrade mp4parser's isoparser to 1.1.7 (TIKA-1924 and TIKA-1931). |
| |
| * Upgrade BouncyCastle to 1.54 (TIKA-1923). |
| |
| * Upgrade Jackcess to 2.1.3 (TIKA-1922). |
| |
| * Upgrade Drew Noakes' metadata-extractor to 2.8.1 (TIKA-1921). |
| |
| * Upgrade Gson in tika-serialization to 2.6.2 (TIka-1920). |
| |
| * Upgrade commons-cli in tika-batch to 1.3.1 (TIKA-1919). |
| |
| * Add XMPMM support to PDFParser and JpegParser via Jempbox (TIKA-1894). |
| |
| * Move serialization of TikaConfig to tika-core and enable dumping |
| of the config file via tika-app (TIKA-1657). |
| |
| * Tika now incorporates the Natural Language Toolkit (NLTK) from the |
| Python community as an option for Named Entity Recognition (TIKA-1876). |
| |
| * Add support for XFA extraction via Pascal Essiembre (TIKA-1857). |
| |
| * Upgrade to sqlite-jdbc 3.8.11.2 (TIKA-1861). NOTE: this dependency |
| is still <scope>provided</scope>. You need to include this dependency |
| in order to parse sqlite files. |
| |
| * Upgrade to POI 3.15-beta1 (TIKA-1895). |
| |
| * Upgrade to Jackson 2.7.1 (TIKA-1869). |
| |
| * Upgrade to Apache SIS 0.6 (TIKA-1878). |
| |
| * RichTextContentHandler moved from the Server package to Core (TIKA-1870). |
| |
| * Added ZeroSizeFileDetector to support application/x-zerovalue via |
| Adesh Gupta (TIKA-1885). |
| |
| * Addition of types information to Grobid quantities parser via |
| Can Menekse (TIKA-1965). |
| |
| Release 1.12 - 01/24/2016 |
| |
| * Support for iFrames and element link extraction is provided in |
| the link Content Handler (TIKA-1835). |
| |
| * Slide notes are now linked to the slide XHTML in the PPT output |
| (TIKA-1840). |
| |
| * JSON tests in Tika server were updated to remove impossible casts |
| (Github-73). |
| |
| * Fix bug in GeoTopicParser where NER is reused instead of instantiated |
| with each request (TIKA-1834). |
| |
| * Upgrade rome to 1.5.1 && Downgrade Rome dependency to 0.9 to avoid |
| nasty NPE (TIKA-1820, TIKA-1516) |
| |
| * The NamedEntityParser was enhanced to generate text content |
| in addition to metadata (TIKA-1815, TIKA-1816). |
| |
| * A significant speed-up is made to the GeoTopicParser by |
| using the new REST server capabilities from Lucene Geo |
| Gazetteer (TIKA-1803). |
| |
| * A parser to compute motion properties in Videos, e.g., |
| Histogram of Oriented Gradients and Histogram of Optical Flows |
| using the Pooled Time Series algorithm, was added (TIKA-1798). |
| |
| * Provide NamedEntityParser which exposes Named Entity Recognition |
| from OpenNLP and Stanford NER providers (TIKA-1787, GitHub-61, |
| GitHub-62). |
| |
| * Allow XHTMLContentHandler to pass attributes of html element |
| via Markus Jelsma (TIKA-1782). |
| |
| * Fix regression with spacing in PPT via Andreas Beeker (TIKA-1777). |
| |
| * Tika Facade parse methods for Path and File added which take a |
| Metadata object, to mirror the existing InputStream one (GitHub-60) |
| |
| * GeoParser fix for loading the NER model from a jar file (TIKA-1791) |
| |
| |
| Release 1.11 - 10/18/2015 |
| |
| * Java7 API support for allowing java.nio.file.Path as method arguments |
| was added to Tika and to ParsingReader, TikaFileTypeDetector, and to |
| Tika Config (TIKA-1745, TIKA-1746, TIKA-1751). |
| |
| * MIME support was added for WebVTT: The Web Video Text Tracks Format |
| files (TIKA-1772). |
| |
| * MIME magic improved to ensure emails detected as message/rfc822 |
| (TIKA-1771). |
| |
| * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility |
| with Bouncy Castle (TIKA-1736). |
| |
| * Make div and other markup more consistent between PPT and |
| PPTX (TIKA-1755). |
| |
| * Parse multiple authors from MSOffice's semi-colon delimited |
| author field (TIKA-1765). |
| |
| * Include CTAKESConfig.properties within tika-parsers resources |
| by default (TIKA-1741). |
| |
| * Prevent infinite recursion when processing inline images |
| in PDF files by limiting extraction of duplicate images |
| within the same page (TIKA-1742). |
| |
| * Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707). |
| |
| * Upgraded tika-batch to use Path throughout (TIKA-1747 and |
| (TIKA-1754). |
| |
| * Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744). |
| |
| * Changed default content handler type for "/rmeta" in tika-server |
| to "xml" to align with "-J" option in tika-app. |
| Clients can now specify handler types via PathParam. (TIKA-1716). |
| |
| * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data |
| for machine learning from PDF files is now integrated as a |
| Tika parser (TIKA-1699, TIKA-1712). |
| |
| * The ability to specify the Tesseract Config Path was added |
| to the OCR Parser (TIKA-1703). |
| |
| * Upgraded to ASM 5.0.4 (TIKA-1705). |
| |
| * Corrected Tika Config XML detector definition explicit loading |
| of MimeTypes (TIKA-1708) |
| |
| * In Tika Parsers, Batch, Server, App and Examples, use Apache |
| Commons IO instead of inlined ex-Commons classes, and the Java 7 |
| Standard Charset definitions (TIKA-1710) |
| |
| * Upgraded to Commons Compress 1.10, which enables zlib compressed |
| archives support (TIKA-1718) |
| |
| |
| Release 1.10 - 8/1/2015 |
| |
| * Tika Config XML can now be used to create composite detectors, |
| and exclude detectors that DefaultDetector would otherwise |
| have used. This brings support in-line with Parsers. (TIKA-1702) |
| |
| * Reverted to legacy sort order of parsers that was |
| mistakenly reversed in Tika 1.9 (TIKA-1689). |
| |
| * Upgrade to POI 3.13-beta1 (TIKA-1667). |
| |
| * Upgrade to PDFBox 1.8.10 (TIKA-1588). |
| |
| * MimeTypes now tries to find a registered type with and |
| without parameters (TIKA-1692). |
| |
| * Added more robust error handling for encoding detection |
| of .MSG files (TIKA-1238). |
| |
| * Fixed bug in Tika's use of the Jackcess parser that |
| prevented reading of v97 Access files (TIKA-1681). |
| |
| * Upgrade xerial.org's sqlite-jdbc to 3.8.10.1. NOTE: |
| as of Tika 1.9, this jar is "provided." Make sure |
| to upgrade your provided jar! (TIKA-1687). |
| |
| * Add header/footer extraction to xls (via Aeham Abushwashi) |
| (TIKA-1400). |
| |
| * Drop the source file name from the embedded file path in |
| RecursiveParserWrapper's "X-TIKA:embedded_resource_path" |
| (TIKA-1673). |
| |
| * Upgraded to Java 7 (TIKA-1536). |
| |
| * Non-standards compliant emails are now correctly detected |
| as message/rfc822 (TIKA-1602). |
| |
| * Added parser for MS Access files via Jackcess. Many thanks |
| to Health Market Science, Brian O'Neill and James Ahlborn |
| for relicensing Jackcess to Apache v2! (TIKA-1601) |
| |
| * GDALParser now correctly sets "nitf" as a supported |
| MediaType (TIKA-1664). |
| |
| * Added DigestingParser to calculate digest hashes |
| and record them in metadata. Integrated with |
| tika-app and tika-server (TIKA-1663). |
| |
| * Fixed ZipContainerDetector to detect all IPA files |
| (TIKA-1659). |
| |
| |
| Release 1.9 - 6/6/2015 |
| |
| * The ability to use the cTAKES clinical text |
| knowledge extraction system for biomedical data is |
| now included as a Tika parser (TIKA-1645, TIKA-1642). |
| |
| * Tika-server allows a user to specify the Tika config |
| from the command line (TIKA-1652, TIKA-1426). |
| |
| * Matlab file detection has been improved (TIKA-1634). |
| |
| * The EXIFTool was added as an External parser |
| (TIKA-1639). |
| |
| * If FFMPEG is installed and on the PATH, it is a |
| usable Parser in Tika now (TIKA-1510). |
| |
| * Fixes have been applied to the ExternalParser to make |
| it functional (TIKA-1638). |
| |
| * Tika service loading can now be more verbose with the |
| org.apache.tika.service.error.warn system property (TIKA-1636). |
| |
| * Tika Server now allows for metadata extraction from remote |
| URLs and in addition it outputs the detected language as a |
| metadata field (TIKA-1625). |
| |
| * OUTPUT_FILE_TOKEN not being replaced in ExternalParser |
| contributed by Pascal Essiembre (TIKA-1620). |
| |
| * Tika REST server now supports language identification |
| (TIKA-1622). |
| |
| * All of the example code from the Tika in Action book has |
| been donated to Tika and added to tika-examples (TIKA-1562). |
| |
| * Tika server now logs errors determining ContentDisposition |
| (TIKA-1621). |
| |
| * An algorithm for using Byte Histogram frequencies to construct |
| a Neural Network and to perform MIME detection was added |
| (TIKA-1582). |
| |
| * A Bayesian algorithm for MIME detection by probabilistic |
| means was added (TIKA-1517). |
| |
| * Tika now incorporates the Apache Spatial Information |
| System capability of parsing Geographic ISO 19139 |
| files (TIKA-443). It can also detect those files as |
| well. |
| |
| * Update the MimeTypes code to support inheritance |
| (TIKA-1535). |
| |
| * Provide ability to parse and identify Global Change |
| Master Directory Interchange Format (GCMD DIF) |
| scientific data files (TIKA-1532). |
| |
| * Improvements to detect CBOR files by extension (TIKA-1610). |
| |
| * Change xerial.org's sqlite-jdbc jar to "provided" (TIKA-1511). |
| Users will now need to add sqlite-jdbc to their classpath for |
| the Sqlite3Parser to work. |
| |
| * ExternalParser.check now catches (suppresses) SecurityException |
| and returns false, so it's OK to run Tika with a security policy |
| that does not allow execution of external processes (TIKA-1628). |
| |
| Release 1.8 - 4/13/2015 |
| |
| * Fix null pointer when processing ODT footer styles (TIKA-1600). |
| |
| * Upgrade to com.drewnoakes' metadata-extractor to 2.0 and |
| add parser for webp metadata (TIKA-1594). |
| |
| * Duration extracted from MP3s with no ID3 tags (TIKA-1589). |
| |
| * Upgraded to PDFBox 1.8.9 (TIKA-1575). |
| |
| * Tika now supports the IsaTab data standard for bioinformatics |
| both in terms of MIME identification and in terms of parsing |
| (TIKA-1580). |
| |
| * Tika server can now enable CORS requests with the command line |
| "--cors" or "-C" option (TIKA-1586). |
| |
| * Update jhighlight dependency to avoid using LGPL license. Thank |
| @kkrugler for his great contribution (TIKA-1581). |
| |
| * Updated HDF and NetCDF parsers to output file version in |
| metadata (TIKA-1578 and TIKA-1579). |
| |
| * Upgraded to POI 3.12-beta1 (TIKA-1531). |
| |
| * Added tika-batch module for directory to directory batch |
| processing. This is a new, experimental capability, and the API will |
| likely change in future releases (TIKA-1330). |
| |
| * Translator.translate() Exceptions are now restricted to |
| TikaException and IOException (TIKA-1416). |
| |
| * Tika now supports MIME detection for Microsoft Extended |
| Makefiles (EMF) (TIKA-1554). |
| |
| * Tika has improved delineation in XML and HTML MIME detection |
| (TIKA-1365). |
| |
| * Upgraded the Drew Noakes metadata-extractor to version 2.7.2 |
| (TIKA-1576). |
| |
| * Added basic style support for ODF documents, contributed by |
| Axel Dörfler (TIKA-1063). |
| |
| * Move Tika server resources and writers to separate |
| org.apache.tika.server.resource and writer packages (TIKA-1564). |
| |
| * Upgrade UCAR dependencies to 4.5.5 (TIKA-1571). |
| |
| * Fix Paths in Tika server welcome page (TIKA-1567). |
| |
| * Fixed infinite recursion while parsing some PDFs (TIKA-1038). |
| |
| * XHTMLContentHandler now properly passes along body attributes, |
| contributed by Markus Jelsma (TIKA-995). |
| |
| * TikaCLI option --compare-file-magic to report mime types known to |
| the file(1) tool but not known / fully known to Tika. |
| |
| * MediaTypeRegistry support for returning known child types. |
| |
| * Support for excluding certain Parsers from being |
| used by DefaultParser via the Tika Config file, using the new |
| parser-exclude tag (TIKA-1558). |
| |
| * Detect Global Change Master Directory (GCMD) Directory |
| Interchange Format (DIF) files (TIKA-1561). |
| |
| * Tika's JAX-RS server can now return stacktraces for |
| parse exceptions (TIKA-1323). |
| |
| * Added MockParser for testing handling of exceptions, errors |
| and hangs in code that uses parsers (TIKA-1553). |
| |
| * The ForkParser service removed from Activator. Rollback of (TIKA-1354). |
| |
| * Increased the speed of language identification by |
| a factor of two -- contributed by Toke Eskildsen (TIKA-1549). |
| |
| * Added parser for Sqlite3 db files. Some users will need to |
| exclude the dependency on xerial.org's sqlite-jdbc because |
| it contains native libs (TIKA-1511). |
| |
| * Use POST instead of PUT for tika-server form methods |
| (TIKA-1547). |
| |
| * A basic wrapper around the UNIX file command was |
| added to extract Strings. In addition a parse to |
| handle Strings parsing from octet-streams using Latin1 |
| charsets as added (TIKA-1541, TIKA-1483). |
| |
| * Add test files and detection mechanism for Gridded |
| Binary (GRIB) files (TIKA-1539). |
| |
| * The RAR parser was updated to handle Chinese characters |
| using the functionality provided by allowing encoding to |
| be used within ZipArchiveInputStream (TIKA-936). |
| |
| * Fix out of memory error in surefire plugin (TIKA-1537). |
| |
| * Build a parser to extract data from GRIB formats (TIKA-1423). |
| |
| * Upgrade to Commons Compress 1.9 (TIKA-1534). |
| |
| * Include media duration in metadata parsed by MP4Parser (TIKA-1530). |
| |
| * Support password protected 7zip files (using a PasswordProvider, |
| in keeping with the other password supporting formats) (TIKA-1521). |
| |
| * Password protected Zip files should not trigger an exception (TIKA-1028). |
| |
| Release 1.7 - 1/9/2015 |
| |
| * Fixed resource leak in OutlookPSTParser that caused TikaException |
| when invoked via AutoDetectParser on Windows (TIKA-1506). |
| |
| * HTML tags are properly stripped from content by FeedParser |
| (TIKA-1500). |
| |
| * Tika Server support for selecting a single metadata key; |
| wrapped MetadataEP into MetadataResource (TIKA-1499). |
| |
| * Tika Server support for JSON and XMP views of metadata (TIKA-1497). |
| |
| * Tika Parent uses dependency management to keep duplicate |
| dependencies in different modules the same version (TIKA-1384). |
| |
| * Upgraded slf4j to version 1.7.7 (TIKA-1496). |
| |
| * Tika Server support for RecursiveParserWrapper's JSON output |
| (endpoint=rmeta) equivalent to (TIKA-1451's) -J option |
| in tika-app (TIKA-1498). |
| |
| * Tika Server support for providing the password for files on a |
| per-request basis through the Password http header (TIKA-1494). |
| |
| * Simple support for the BPG (Better Portable Graphics) image format |
| (TIKA-1491, TIKA-1495). |
| |
| * Prevent exceptions from being thrown for some malformed |
| mp3 files (TIKA-1218). |
| |
| * Reformat pom.xml files to use two spaces per indent (TIKA-1475). |
| |
| * Fix warning of slf4j logger on Tika Server startup (TIKA-1472). |
| |
| * Tika CLI and GUI now have option to view JSON rendering of output |
| of RecursiveParserWrapper (TIKA-1451). |
| |
| * Tika now integrates the Geospatial Data Abstraction Library |
| (GDAL) for parsing hundreds of geospatial formats (TIKA-605, |
| TIKA-1503). |
| |
| * ExternalParsers can now use Regexs to specify dynamic keys |
| (TIKA-1441). |
| |
| * Thread safety issues in ImageMetadataExtractor were resolved |
| (TIKA-1369). |
| |
| * The ForkParser service is now registered in Activator |
| (TIKA-1354). |
| |
| * The Rome Library was upgraded to version 1.5 (TIKA-1435). |
| |
| * Add markup for files embedded in PDFs (TIKA-1427). |
| |
| * Extract files embedded in annotations in PDFS (TIKA-1433). |
| |
| * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442). |
| |
| * Add RecursiveParserWrapper (aka Jukka's and Nick's) |
| RecursiveMetadataParser (TIKA-1329) |
| |
| * Add example for how to dump TikaConfig to XML (TIKA-1418). |
| |
| * Allow users to specify a tika config file for tika-app (TIKA-1426). |
| |
| * PackageParser includes the last-modified date from the archive |
| in the metadata, when handling embedded entries (TIKA-1246) |
| |
| * Created a new Tesseract OCR Parser to extract text from images. |
| Requires installation of Tesseract before use (TIKA-93). |
| |
| * Basic parser for older Excel formats, such as Excel 4, 5 and 95, |
| which can get simple text, and metadata for Excel 5+95 (TIKA-1490) |
| |
| |
| Release 1.6 - 08/31/2014 |
| |
| * Parse output should indicate which Parser was actually used |
| (TIKA-674). |
| |
| * Use the forbidden-apis Maven plugin to check for unsafe Java |
| operations (TIKA-1387). |
| |
| * Created an ExternalTranslator class to interface with command |
| line Translators (TIKA-1385). |
| |
| * Created a MosesTranslator as a subclass of ExternalTranslator |
| that calls the Moses Decoder machine translation program (TIKA-1385). |
| |
| * Created the tika-example module. It will have examples of how to |
| use the main Tika interfaces (TIKA-1390). |
| |
| * Upgraded to Commons Compress 1.8.1 (TIKA-1275). |
| |
| * Upgraded to POI 3.11-beta1 (TIKA-1380). |
| |
| * Tika now extracts SDTCell content from tables in .docx files (TIKA-1317). |
| |
| * Tika now supports detection of the Persian/Farsi language. |
| (TIKA-1337) |
| |
| * The Tika Detector interface is now exposed through the JAX-RS |
| server (TIKA-1336, TIKA-1336). |
| |
| * Tika now has support for parsing binary Matlab files as part of |
| our larger effort to increase the number of scientific data formats |
| supported. (TIKA-1327) |
| |
| * The Tika Server URLs for the unpacker resources have been changed, |
| to bring them under a common prefix (TIKA-1324). The mapping is |
| /unpacker/{id} -> /unpack/{id} |
| /all/{id} -> /unpack/all/{id} |
| |
| * Added module and core Tika interface for translating text between |
| languages and added a default implementation that call's Microsoft's |
| translate service (TIKA-1319) |
| |
| * Added an Translator implementation that calls Lingo24's Premium |
| Machine Translation API (TIKA-1381) |
| |
| * Made RTFParser's list handling slightly more robust against corrupt |
| list metadata (TIKA-1305) |
| |
| * Fixed bug in CLI json output (TIKA-1291/TIKA-1310) |
| |
| * Added ability to turn off image extraction from PDFs (TIKA-1294). |
| Users must now turn on this capability via the PDFParserConfig. |
| |
| * Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352) |
| |
| * Zip Container Detection for DWFX and XPS formats, which are OPC |
| based (TIKA-1204, TIKA-1221) |
| |
| * Added a user facing welcome page to the Tika Server, which |
| says what it is, and a very brief summary of what is available. |
| (TIKA-1269) |
| |
| * Added Tika Server endpoints to list the available mime types, |
| Parsers and Detectors, similar to the --list-<foo> methods on |
| the Tika CLI App (TIKA-1270) |
| |
| * Improvements to NetCDF and HDF parsing to mimic the output of |
| ncdump and extract text dimensions and spatial and variable |
| information from scientific data files (TIKA-1265) |
| |
| * Extract attachments from RTF files (TIKA-1010) |
| |
| * Support Outlook Personal Folders File Format *.pst (TIKA-623) |
| |
| * Added mime entries for additional Ogg based formats (TIKA-1259) |
| |
| * Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider |
| range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113) |
| |
| * PDF: Images in PDF documents can now be extracted as embedded resources. |
| (TIKA-1268) |
| |
| * Fixed RuntimeException thrown for certain Word Documents (TIKA-1251). |
| |
| * CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs |
| the list of supported parsers in APT format. This is used to generate the list |
| on the formats page (TIKA-411). |
| |
| Release 1.5 - 02/04/2014 |
| |
| * Fixed bug in handling of embedded file processing in PDFs (TIKA-1228). |
| |
| * Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224). |
| |
| * Updated Tika Server to support multipart/form-data payloads (TIKA-1198). |
| |
| * Updated Tika Server to CXF 2.7.8 (TIKA-1197). |
| |
| * Updated Tika Server to accept requests over wildcard addresses (TIKA-1196). |
| |
| * Added option to use alternate NonSequentialPDFParser (TIKA-1201). |
| |
| * Content from PDF AcroForms is now extracted (TIKA-973). |
| |
| * Fixed invalid asterisks from master slide in PPT (TIKA-1171). |
| |
| * Added test cases to confirm handling of auto-date in PPT and PPTX (TIKA-817). |
| |
| * Text from tables in PPT files is once again extracted correctly (TIKA-1076). |
| |
| * Text is extracted from text boxes in XLSX (TIKA-1100). |
| |
| * Tika no longer hangs when processing Excel files with custom fraction format (TIKA-1132). |
| |
| * Disconcerting stacktrace from missing beans no longer printed for some DOCX files (TIKA-792). |
| |
| * Upgraded POI to 3.10-beta2 (TIKA-1173). |
| |
| * Upgraded PDFBox to 1.8.4 (TIKA-1230). |
| |
| * Made HtmlEncodingDetector more flexible in finding meta |
| header charset (TIKA-1001). |
| |
| * Added sanitized test HTML file for local file test (TIKA-1139). |
| |
| * Fixed bug that prevented attachments within a PDF from being processed |
| if the PDF itself was an attachment (TIKA-1124). |
| |
| * Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130). |
| |
| * RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override (TIKA-1192). |
| |
| * CLI: TikaCLI now escapes invalid filename characters as hex |
| characters (TIKA-1078). |
| |
| Release 1.4 - 06/15/2013 |
| |
| * Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129). |
| |
| * Improvements to tika-server to allow it to produce text/html and |
| text/xml content (TIKA-1126, TIKA-1127). |
| |
| * Improvements were made to the Compressor Parser to handle g'zipped files |
| that require the decompressConcatenated option set to true (TIKA-1096). |
| |
| * Addressed a typographic error that was preventing from detection of |
| awk files (TIKA-1081). |
| |
| * Added a new end-point to Tika's JAX-RS REST server that only detects |
| the media-type based on a small portion of the document submitted |
| (TIKA-1047). |
| |
| * RTF: Ordered and unordered lists are now extracted (TIKA-1062). |
| |
| * MP3: Audio duration is now extracted (TIKA-991) |
| |
| * Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing |
| the Java bytecodes (TIKA-1053). |
| |
| * Mime Types: Definitions extended to optionally include Link (URL) and |
| UTI, along with details for several common formats (TIKA-1012 / TIKA-1083) |
| |
| * Exceptions when parsing OLE10 embedded documents, when parsing |
| summary information from Office documents, and when saving |
| embedded documennts in TikaCLI are now logged instead |
| of aborting extraction (TIKA-1074) |
| |
| * MS Word: line tabular character is now replaced with newline |
| (TIKA-1128) |
| |
| * XML: ElementMetadataHandlers can now optionally accept duplicate |
| and empty values (TIKA-1133) |
| |
| Release 1.3 - 01/19/2013 |
| |
| * Mimetype definitions added for more common programming languages, |
| including common extensions, but not magic patterns. (TIKA-1055) |
| |
| * MS Word: When a Word (.doc) document contains embedded files or |
| links to external documents, Tika now places a <div |
| class="embedded" id="_XXX"/> placeholder into the XHTML so you can |
| see where in the main text the embedded document occurred |
| (TIKA-956, TIKA-1019). Embedded Wordpad/RTF documents are now |
| recognized (TIKA-982). |
| |
| * PDF: Text from pop-up annotations is now extracted (TIKA-981). |
| Text from bookmarks is now extracted (TIKA-1035). |
| |
| * PKCS7: Detached signatures no longer through NullPointerException |
| (TIKA-986). |
| |
| * iWork: The chart name for charts embedded in numbers documents is |
| now extracted (TIKA-918). |
| |
| * CLI: TikaCLI -m now handles multi-valued metadata keys correctly |
| (previously it only printed the first value). (TIKA-920) |
| |
| * MS Word (.docx): When a Word (.docx) document contains embedded |
| files, Tika now places a <div class="embedded" id="XXX"/> into the |
| XHTML so you can see where in the main text the embedded document |
| occurred. The id (rId) is included in the Metadata of each |
| embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID |
| key, and TikaCLI prepends the rId (if present) onto the filename |
| it extracts (TIKA-989). Fixed NullPointerException when style is |
| null (TIKA-1006). Text inside text boxes is now extracted |
| (TIKA-1005). |
| |
| * RTF: Page, word, character count and creation date metadata are |
| now extracted for RTF documents (TIKA-999). |
| |
| * MS PowerPoint (.pptx): When a PowerPoint (.pptx) document contains |
| embedded files, Tika now places a <div class="embedded" id="XXX"/> into the |
| XHTML so you can see where in the main text the embedded document |
| occurred. The id (rId) is included in the Metadata of each |
| embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID |
| key, and TikaCLI prepends the rId (if present) onto the filename |
| it extracts (TIKA-997, TIKA-1032). |
| |
| * MS PowerPoint (.ppt): When a PowerPoint (.ppt) document contains |
| embedded files, Tika now places a <div class="embedded" id="XXX"/> into the |
| XHTML so you can see where in the main text the embedded document |
| occurred (TIKA-1025). Text from the master slide is now extracted |
| (TIKA-712). |
| |
| * MHTML: fixed Null charset name exception when a mime part has an |
| unrecognized charset (TIKA-1011). |
| |
| * MP3: if an ID3 tag was encoded in UTF-16 with only the BOM then on |
| certain JVMs this would incorrectly extract the BOM as the tag's |
| value (TIKA-1024). |
| |
| * ZIP: placeholders (<div class="embedded" id="<entry name>"/>) are |
| now left in the XHTML so you can see where each archive member |
| appears (TIKA-1036). TikaCLI would hit FileNotFoundException when |
| extracting files that were under sub-directories from a ZIP |
| archive, because it failed to create the parent directories first |
| (TIKA-1031). |
| |
| * XML: a space character is now added before each element |
| (TIKA-1048) |
| |
| Release 1.2 - 07/10/2012 |
| --------------------------------- |
| |
| * Tika's JAX-RS based Network server now is based on Apache CXF, |
| which is available in Maven Central and now allows the server |
| module to be packaged and included in our release |
| (TIKA-593, TIKA-901). |
| |
| * Tika: parseToString now lets you specify the max string length |
| per-call, in addition to per-Tika-instance. (TIKA-870) |
| |
| * Tika now has the ability to detect FITS (Flexible Image Transport System) |
| files (TIKA-874). |
| |
| * Images: Fixed file handle leak in ImageParser. (TIKA-875) |
| |
| * iWork: Comments in Pages files are now extracted (TIKA-907). |
| Headers, footers and footnotes in Pages files are now extracted |
| (TIKA-906). Don't throw NullPointerException on passsword |
| protected iWork files, even though we can't parse their contents |
| yet (TIKA-903). Text extracted from Keynote text boxes and bullet |
| points no longer runs together (TIKA-910). Also extract text for |
| Pages documents created in layout mode (TIKA-904). Table names |
| are now extracted in Numbers documents (TIKA-924). Content added |
| to master slides is also extracted (TIKA-923). |
| |
| * Archive and compression formats: The Commons Compress dependency was |
| upgraded from 1.3 to 1.4.1. With this change Tika can now parse also |
| Unix dump archives and documents compressed using the XZ and Pack200 |
| compression formats. (TIKA-932) |
| |
| * KML: Tika now has basic support for Keyhole Markup Language documents |
| (KML and KMZ) used by tools like Google Earth. See also |
| http://www.opengeospatial.org/standards/kml/. (TIKA-941) |
| |
| * CLI: You can now use the TIKA_PASSWORD environment variable or the |
| --password=X command line option to specify the password that Tika CLI |
| should use for opening encrypted documents (TIKA-943). |
| |
| * Character encodings: Tika's character encoding detection mechanism was |
| improved by adding integration to the juniversalchardet library that |
| implements Mozilla's universal charset detection algorithm. The slower |
| ICU4J algorithms are still used as a fallback thanks to their wider |
| coverage of custom character encodings. (TIKA-322, TIKA-471) |
| |
| * Charset parameter: Related to the character encoding improvements |
| mentioned above, Tika now returns the detected character encoding as |
| a "charset" parameter of the content type metadata field for text/plain |
| and text/html documents. For example, instead of just "text/plain", the |
| returned content type will be something like "text/plain; charset=UTF-8" |
| for a UTF-8 encoded text document. Character encoding information is still |
| present also in the content encoding metadata field for backwards |
| compatibility, but that field should be considered deprecated. (TIKA-431) |
| |
| * Extraction of embedded resources from OLE2 Office Documents, where |
| the resource isn't another office document, has been fixed (TIKA-948) |
| |
| Release 1.1 - 3/7/2012 |
| --------------------------------- |
| |
| * Link Extraction: The rel attribute is now extracted from |
| links per the LinkConteHandler. (TIKA-824) |
| |
| * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously |
| the last character in a UTF-16 tag could be corrupted) (TIKA-793) |
| |
| * Performance: Loading of the default media type registry is now |
| significantly faster. (TIKA-780) |
| |
| * PDF: Allow controlling whether overlapping duplicated text should |
| be removed. Disabling this (the default) can give big |
| speedups to text extraction and may workaround cases where |
| non-duplicated characters were incorrectly removed (TIKA-767). |
| Allow controlling whether text tokens should be sorted by their x/y |
| position before extracting text (TIKA-612); this is necessary for |
| certain PDFs. Fixed cases where too many </p> tags appear in the |
| XHTML output, causing NPE when opening some PDFs with the GUI |
| (TIKA-778). |
| |
| * RTF: Fixed case where a font change would result in processing |
| bytes in the wrong font's charset, producing bogus text output |
| (TIKA-777). Don't output whitespace in ignored group states, |
| avoiding excessive whitespace output (TIKA-781). Binary embedded |
| content (using \bin control word) is now skipped correctly; |
| previously it could cause the parser to incorrectly extract binary |
| content as text (TIKA-782). |
| |
| * CLI: New TikaCLI option "--list-detectors", which displays the |
| mimetype detectors that are available, similar to the existing |
| "--list-parsers" option for parsers. (TIKA-785). |
| |
| * Detectors: The order of detectors, as supplied via the service |
| registry loader, is now controlled. User supplied detectors are |
| prefered, then Tika detectors (such as the container aware ones), |
| and finally the core Tika MimeTypes is used as a backup. This |
| allows for specific, detailed detectors to take preference over |
| the default mime magic + filename detector. (TIKA-786) |
| |
| * Microsoft Project (MPP): Filetype detection has been fixed, |
| and basic metadata (but no text) is now extracted. (TIKA-789) |
| |
| * Outlook: fixed NullPointerException in TikaGUI when messages with |
| embedded RTF or HTML content were filtered (TIKA-801). |
| |
| * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio |
| files, which extract audio metadata and tags (TIKA-747) |
| |
| * MP4: Improved mime magic detection for MP4 based formats (including |
| QuickTime, MP4 Video and Audio, and 3GPP) (TIKA-851) |
| |
| * MP4: Basic metadata extracting parser for MP4 files added, which includes |
| limited audio and video metadata, along with the iTunes media metadata |
| (such as Artist and Title) (TIKA-852) |
| |
| * Document Passwords: A new ParseContext object, PasswordProvider, |
| has been added. This provides a way to supply the password for |
| a document during processing. Currently, only password protected |
| PDFs and Microsoft OOXML Files are supported. (TIKA-850) |
| |
| Release 1.0 - 11/4/2011 |
| --------------------------------- |
| |
| The most notable changes in Tika 1.0 over previous releases are: |
| |
| * API: All methods, classes and interfaces that were marked as |
| deprecated in Tika 0.10 have been removed to clean up the API |
| (TIKA-703). You may need to adjust and recompile client code |
| accordingly. The declared OSGi package versions are now 1.0, and |
| will thus not resolve for client bundles that still refer to 0.x |
| versions (TIKA-565). |
| |
| * Configuration: The context class loader of the current thread is |
| no longer used as the default for loading configured parser and |
| detector classes. You can still pass an explicit class loader |
| to the configuration mechanism to get the previous behaviour. |
| (TIKA-565) |
| |
| * OSGi: The tika-core bundle will now automatically pick up and use |
| any available Parser and Detector services when deployed to an OSGi |
| environment. The tika-parsers bundle provides such services based on |
| for all the supported file formats for which the upstream parser library |
| is available. If you don't want to track all the parser libraries as |
| separate OSGi bundles, you can use the tika-bundle bundle that packages |
| tika-parsers together with all its upstream dependencies. (TIKA-565) |
| |
| * RTF: Hyperlinks in RTF documents are now extracted as an <a |
| href=...>...</a> element (TIKA-632). The RTF parser is also now |
| more robust when encountering too many closing {'s vs. opening {'s |
| (TIKA-733). |
| |
| * MS Word: From Word (.doc) documents we now extract optional hyphen |
| as Unicode zero-width space (U+200B), and non-breaking hyphen as |
| Unicode non-breaking hyphen (U+2011). (TIKA-711) |
| |
| * Outlook: Tika can now process also attachments in Outlook messages. |
| (TIKA-396) |
| |
| * MS Office: Performance of extracting embedded office docs was improved. |
| (TIKA-753) |
| |
| * PDF: The PDF parser now extracts paragraphs within each page |
| (TIKA-742) and can now optionally extract text from PDF |
| annotations (TIKA-738). There's also an option to enable (the |
| default) or disable auto-space insertion (TIKA-724). |
| |
| * Language detection: Tika can now detect Belarusian, Catalan, |
| Esperanto, Galician, Lithuanian (TIKA-582), Romanian, Slovak, |
| Slovenian, and Ukrainian (TIKA-681). |
| |
| * Java: Tika no longer ships retrotranslated Java 1.4 binaries along |
| with the normal ones that work with Java 5 and higher. (TIKA-744) |
| |
| * OpenOffice documents: header/footer text is now extracted for text, |
| presentation and spreadsheet documents (TIKA-736) |
| |
| Tika 1.0 relies on the following set of major dependencies (generated using |
| mvn dependency:tree from tika-parsers): |
| |
| org.apache.tika:tika-parsers:bundle:1.0 |
| +- org.apache.tika:tika-core:jar:1.0:compile |
| +- edu.ucar:netcdf:jar:4.2-min:compile |
| | \- org.slf4j:slf4j-api:jar:1.5.6:compile |
| +- org.apache.james:apache-mime4j-core:jar:0.7:compile |
| +- org.apache.james:apache-mime4j-dom:jar:0.7:compile |
| +- org.apache.commons:commons-compress:jar:1.3:compile |
| +- commons-codec:commons-codec:jar:1.5:compile |
| +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile |
| | +- org.apache.pdfbox:fontbox:jar:1.6.0:compile |
| | +- org.apache.pdfbox:jempbox:jar:1.6.0:compile |
| | \- commons-logging:commons-logging:jar:1.1.1:compile |
| +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile |
| +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile |
| +- org.apache.poi:poi:jar:3.8-beta4:compile |
| +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile |
| +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile |
| | +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile |
| | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile |
| | \- dom4j:dom4j:jar:1.6.1:compile |
| +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile |
| +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile |
| +- asm:asm:jar:3.1:compile |
| +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile |
| +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile |
| +- rome:rome:jar:0.9:compile |
| \- jdom:jdom:jar:1.0:compile |
| |
| The following people have contributed to Tika 1.0 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Andrzej Bialecki |
| Antoni Mylka |
| Benson Margulies |
| Chris A. Mattmann |
| Cristian Vat |
| Dave Meikle |
| David Smiley |
| Dennis Adler |
| Erik Hetzner |
| Ingo Renner |
| Jeremias Maerki |
| Jeremy Anderson |
| Jeroen van Vianen |
| John Bartak |
| Jukka Zitting |
| Julien Nioche |
| Ken Krugler |
| Mark Butler |
| Maxim Valyanskiy |
| Michael Bryant |
| Michael McCandless |
| Nick Burch |
| Pablo Queixalos |
| Uwe Schindler |
| Žygimantas Medelis |
| |
| |
| See http://s.apache.org/Zk6 for more details on these contributions. |
| |
| |
| Release 0.10 - 09/25/2011 |
| ------------------------- |
| |
| The most notable changes in Tika 0.10 over previous releases are: |
| |
| * A parser for CHM help files was added. (TIKA-245) |
| |
| * TIKA-698: Invalid characters are now replaced with the Unicode |
| replacement character (U+FFFD), whereas before such characters were |
| replaced with spaces, so you may need to change your processing of |
| Tika's output to now handle U+FFFD. |
| |
| * The RTF parser was rewritten to perform its own direct shallow |
| parse of the RTF content, instead of using RTFEditorKit from |
| javax.swing. This fixes several issues in the old parser, |
| including doubling of Unicode characters in certain cases |
| (TIKA-683), exceptions on mal-formed RTF docs (TIKA-666), and |
| missing text from some elements (header/footer, hyperlinks, |
| footnotes, text inside pictures). |
| |
| * Handling of temporary files within Tika was much improved |
| (TIKA-701, TIKA-654, TIKA-645, TIKA-153) |
| |
| * The Tika GUI got a facelift and some extra features (TIKA-635) |
| |
| * The apache-mime4j dependency of the email message parser was upgraded |
| from version 0.6 to 0.7 (TIKA-716). The parser also now accepts a |
| MimeConfig object in the ParseContext as configuration (TIKA-640). |
| |
| Tika 0.10 relies on the following set of major dependencies (generated using |
| mvn dependency:tree from tika-parsers): |
| |
| org.apache.tika:tika-parsers:bundle:0.10 |
| +- org.apache.tika:tika-core:jar:0.10:compile |
| +- edu.ucar:netcdf:jar:4.2-min:compile |
| | \- org.slf4j:slf4j-api:jar:1.5.6:compile |
| +- org.apache.james:apache-mime4j-core:jar:0.7:compile |
| +- org.apache.james:apache-mime4j-dom:jar:0.7:compile |
| +- org.apache.commons:commons-compress:jar:1.1:compile |
| +- commons-codec:commons-codec:jar:1.4:compile |
| +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile |
| | +- org.apache.pdfbox:fontbox:jar:1.6.0:compile |
| | +- org.apache.pdfbox:jempbox:jar:1.6.0:compile |
| | \- commons-logging:commons-logging:jar:1.1.1:compile |
| +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile |
| +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile |
| +- org.apache.poi:poi:jar:3.8-beta4:compile |
| +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile |
| +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile |
| | +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile |
| | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile |
| | \- dom4j:dom4j:jar:1.6.1:compile |
| +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile |
| +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile |
| +- asm:asm:jar:3.1:compile |
| +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile |
| +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile |
| +- rome:rome:jar:0.9:compile |
| \- jdom:jdom:jar:1.0:compile |
| |
| The following people have contributed to Tika 0.10 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Alain Viret |
| Alex Ott |
| Alexander Chow |
| Andreas Kemkes |
| Andrew Khoury |
| Babak Farhang |
| Benjamin Douglas |
| Benson Margulies |
| Chris A. Mattmann |
| chris hudson |
| Chris Lott |
| Cristian Vat |
| Curt Arnold |
| Cynthia L Wong |
| Dave Brosius |
| David Benson |
| Enrico Donelli |
| Erik Hetzner |
| Erna de Groot |
| Gabriele Columbro |
| Gavin |
| Geoff Jarrad |
| Gregory Kanevsky |
| gunter rombauts |
| Henning Gross |
| Henri Bergius |
| Ingo Renner |
| Ingo Wiarda |
| Izaak Alpert |
| Jan H√∏ydahl |
| Jens Wilmer |
| Jeremy Anderson |
| Joseph Vychtrle |
| Joshua Turner |
| Jukka Zitting |
| Julien Nioche |
| Karl Heinz Marbaise |
| Ken Krugler |
| Kostya Gribov |
| Luciano Leggieri |
| Mads Hansen |
| Mark Butler |
| Matt Sheppard |
| Maxim Valyanskiy |
| Michael McCandless |
| Michael Pisula |
| Murad Shahid |
| Nick Burch |
| Oleg Tikhonov |
| Pablo Queixalos |
| Paul Jakubik |
| Raimund Merkert |
| Rajiv Kumar |
| Robert Trickey |
| Sami Siren |
| samraj |
| Selva Ganesan |
| Sjoerd Smeets |
| Stephen Duncan Jr |
| Tran Nam Quang |
| Uwe Schindler |
| Vitaliy Filippov |
| |
| See http://s.apache.org/vR for more details on these contributions. |
| |
| |
| Release 0.9 - 02/13/2011 |
| ------------------------ |
| |
| The most notable changes in Tika 0.9 over previous releases are: |
| |
| * A critical bugfix preventing metadata from printing to the |
| command line when the underlying Parser didn't generate |
| XHTML output was fixed. (TIKA-596) |
| |
| * The 0.8 version of Tika included a NetCDF jar file that pulled |
| in tremendous amounts of redundant dependencies. This has |
| been addressed in Tika 0.9 by republishing a minimal NetCDF |
| jar and changing Tika to depend on that. (TIKA-556) |
| |
| * MIME detection for iWork, and OpenXML documents has been |
| improved. (TIKA-533, TIKA-562, TIKA-588) |
| |
| * A critical backwards incompatible bug in PDF parsing that |
| was introduced in Tika 0.8 has been fixed. (TIKA-548) |
| |
| * Support for forked parsing in separate processes was added. |
| (TIKA-416) |
| |
| * Tika's language identifier now supports the Lithuanian |
| language. (TIKA-582) |
| |
| Tika 0.9 relies on the following set of major dependencies (generated using |
| mvn dependency:tree from tika-parsers): |
| |
| org.apache.tika:tika-parsers:bundle:0.9 |
| +- org.apache.tika:tika-core:jar:0.9:compile |
| +- edu.ucar:netcdf:jar:4.2-min:compile |
| | \- org.slf4j:slf4j-api:jar:1.5.6:compile |
| +- commons-httpclient:commons-httpclient:jar:3.1:compile |
| | +- commons-logging:commons-logging:jar:1.1.1:compile (version managed from 1.0.4) |
| | \- commons-codec:commons-codec:jar:1.2:compile |
| +- org.apache.james:apache-mime4j:jar:0.6:compile |
| +- org.apache.commons:commons-compress:jar:1.1:compile |
| +- org.apache.pdfbox:pdfbox:jar:1.4.0:compile |
| | +- org.apache.pdfbox:fontbox:jar:1.4.0:compile |
| | \- org.apache.pdfbox:jempbox:jar:1.4.0:compile |
| +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile |
| +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile |
| +- org.apache.poi:poi:jar:3.7:compile |
| +- org.apache.poi:poi-scratchpad:jar:3.7:compile |
| +- org.apache.poi:poi-ooxml:jar:3.7:compile |
| | +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile |
| | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile |
| | \- dom4j:dom4j:jar:1.6.1:compile |
| +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile |
| +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile |
| +- asm:asm:jar:3.1:compile |
| +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile |
| +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile |
| +- rome:rome:jar:0.9:compile |
| \- jdom:jdom:jar:1.0:compile |
| |
| The following people have contributed to Tika 0.9 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Alex Skochin |
| Alexander Chow |
| Antoine L. |
| Antoni Mylka |
| Benjamin Douglas |
| Benson Margulies |
| Chris A. Mattmann |
| Cristian Vat |
| Cyriel Vringer |
| David Benson |
| Erik Hetzner |
| Gabriel Miklos |
| Geoff Jarrad |
| Jukka Zitting |
| Ken Krugler |
| Kostya Gribov |
| Leszek Piotrowicz |
| Martijn van Groningen |
| Maxim Valyanskiy |
| Michel Tremblay |
| Nick Burch |
| paul |
| Paul Pearcy |
| Peter van Raamsdonk |
| Piotr Bartosiewicz |
| Reinhard Schwab |
| Scott Severtson |
| Shinsuke Sugaya |
| Staffan Olsson |
| Steve Kearns |
| Tom Klonikowski |
| ≈Ωygimantas Medelis |
| |
| See http://s.apache.org/qi for more details on these contributions. |
| |
| |
| Release 0.8 - 11/07/2010 |
| ------------------------ |
| |
| The most notable changes in Tika 0.8 over previous releases are: |
| |
| * Language identification is now dynamically configurable, |
| managed via a config file loaded from the classpath. (TIKA-490) |
| |
| * Tika now supports parsing Feeds by wrapping the underlying |
| Rome library. (TIKA-466) |
| |
| * A quick-start guide for Tika parsing was contributed. (TIKA-464) |
| |
| * An approach for plumbing through XHTML attributes was added. (TIKA-379) |
| |
| * Media type hierarchy information is now taken into account when |
| selecting the best parser for a given input document. (TIKA-298) |
| |
| * Support for parsing common scientific data formats including netCDF |
| and HDF4/5 was added (TIKA-400 and TIKA-399). |
| |
| * Unit tests for Windows have been fixed, allowing TestParsers |
| to complete. (TIKA-398) |
| |
| Tika 0.8 relies on the following set of major dependencies (generated using |
| mvn dependency:tree from tika-parsers): |
| |
| org.apache.tika:tika-parsers:bundle:0.8 |
| +- org.apache.tika:tika-core:jar:0.8:compile |
| +- edu.ucar:netcdf:jar:4.2:compile |
| | \- org.slf4j:slf4j-api:jar:1.5.6:compile |
| +- commons-httpclient:commons-httpclient:jar:3.1:compile |
| | +- commons-logging:commons-logging:jar:1.1.1:compile (version managed from 1.0.4) |
| | \- commons-codec:commons-codec:jar:1.2:compile |
| +- org.apache.commons:commons-compress:jar:1.1:compile |
| +- org.apache.pdfbox:pdfbox:jar:1.3.1:compile |
| | +- org.apache.pdfbox:fontbox:jar:1.3.1:compile |
| | \- org.apache.pdfbox:jempbox:jar:1.3.1:compile |
| +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile |
| +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile |
| +- org.apache.poi:poi:jar:3.7:compile |
| +- org.apache.poi:poi-scratchpad:jar:3.7:compile |
| +- org.apache.poi:poi-ooxml:jar:3.7:compile |
| | +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile |
| | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile |
| | \- dom4j:dom4j:jar:1.6.1:compile |
| +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile |
| +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile |
| +- asm:asm:jar:3.1:compile |
| +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile |
| +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile |
| +- rome:rome:jar:0.9:compile |
| \- jdom:jdom:jar:1.0:compile |
| |
| The following people have contributed to Tika 0.8 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Łukasz Wiktor |
| Adam Wilmer |
| Alex Baranau |
| Alex Ott |
| André Ricardo |
| Andrey Barhatov |
| Andrey Sidorenko |
| Antoni Mylka |
| Arturo Beltran |
| Attila Kir√°ly |
| Brad Greenlee |
| Bruno Dumon |
| Chris A. Mattmann |
| Chris Bamford |
| Christophe Gourmelon |
| Dave Meikle |
| David Weekly |
| Dmitry Kuzmenko |
| Erik Hetzner |
| Geoff Jarrad |
| Gerd Bremer |
| Grant Ingersoll |
| Jan H√∏ydahl |
| Jean-Philippe Ricard |
| Jeremias Maerki |
| Joao Garcia |
| Jukka Zitting |
| Julien Nioche |
| Ken Krugler |
| Liam O'Boyle |
| Mads Hansen |
| Marcel May |
| Markus Goldbach |
| Martijn van Groningen |
| Maxim Valyanskiy |
| Mike Hays |
| Miroslav Pokorny |
| Nick Burch |
| Otis Gospodnetic |
| Peter van Raamsdonk |
| Peter Wolanin |
| Peter_Lenahan@ibi.com |
| Piotr Bartosiewicz |
| Radek |
| Rajiv Kumar |
| Reinhard Schwab |
| rick cameron |
| Robert Muir |
| Sanjeev Rao |
| Simon Tyler |
| Sjoerd Smeets |
| Slavomir Varchula |
| Staffan Olsson |
| Tom De Leu |
| Uwe Schindler |
| Victor Kazakov |
| |
| See http://s.apache.org/ab0 for more details on these contributions. |
| |
| |
| Release 0.7 - 3/31/2010 |
| ----------------------- |
| |
| The most notable changes in Tika 0.7 over previous releases are: |
| |
| * MP3 file parsing was improved, including Channel and SampleRate |
| extraction and ID3v2 support (TIKA-368, TIKA-372). Further, audio |
| parsing mime detection was also improved for the MIDI format. (TIKA-199) |
| |
| * Tika no longer relies on X11 for its RTF parsing functionality. (TIKA-386) |
| |
| * A Thread-safe bug in the AutoDetectParser was discovered and |
| addressed. (TIKA-374) |
| |
| * Upgrade to PDFBox 1.0.0. The new PDFBox version improves PDF parsing |
| performance and fixes a number of text extraction issues. (TIKA-380) |
| |
| The following people have contributed to Tika 0.7 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Adam Rauch |
| Benson Margulies |
| Brett S. |
| Chris A. Mattmann |
| Daan de Wit |
| Dave Meikle |
| Durville |
| Ingo Renner |
| Jukka Zitting |
| Ken Krugler |
| Kenny Neal |
| Markus Goldbach |
| Maxim Valyanskiy |
| Nick Burch |
| Sami Siren |
| Uwe Schindler |
| |
| See http://tinyurl.com/yklopby for more details on these contributions. |
| |
| |
| Release 0.6 - 01/20/2010 |
| ------------------------ |
| |
| The most notable changes in Tika 0.6 over the previous release are: |
| |
| * Mime-type detection for HTML (and all types) has been improved, allowing malformed |
| HTML files and those HTML files that require a bit more observed content |
| before the type is properly detected, are now correctly identified by |
| the AutoDetectParser. (TIKA-327, TIKA-357, TIKA-366, TIKA-367) |
| |
| * Tika now has an additional OSGi bundle packaging that includes all the |
| required parser libraries. This bundle package makes it easy to use all |
| Tika features in an OSGi environment. (TIKA-340, TIKA-342) |
| |
| * The Apache POI dependency used for parsing Microsoft Office file formats |
| has been upgraded to version 3.6. The most visible improvement in this |
| version is the notably reduced ooxml jar file size. The tika-app jar size |
| is now down to 15MB from the 25MB in Tika 0.5. (TIKA-353) |
| |
| * Handling of character encoding information in input metadata and HTML |
| <meta> tags has been improved. When no applicable encoding information is |
| available, the encoding is detected by looking at the input data. |
| (TIKA-332, TIKA-334, TIKA-335, TIKA-341) |
| |
| * Some document types like Excel spreadsheets contain content like |
| numbers or formulas whose exact text format depends on the current locale. |
| So far Tika has used the platform default locale in such cases, but |
| clients can now explicitly specify the locale by passing a Locale instance |
| in the parse context. (TIKA-125) |
| |
| * The default text output encoding of the tika-app jar is now UTF-8 |
| when running on Mac OS X. This is because the default encoding used |
| by Java is not compatible with the console application in Mac OS X. |
| On all other platforms the text output from tika-app still uses |
| the platform default encoding. (TIKA-324) |
| |
| * A flash video (video/x-flv) parser has been added. (TIKA-328) |
| |
| * The handling of Number and Date cell formatting within the Microsoft Excel |
| documents has been added. This include currencies, percentages and |
| scientific formats. (TIKA-103) |
| |
| The following people have contributed to Tika 0.6 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Andrzej Bialecki |
| Bertrand Delacretaz |
| Chris A. Mattmann |
| Dave Meikle |
| Erik Hetzner |
| Felix Meschberger |
| Jukka Zitting |
| Julien Nioche |
| Ken Krugler |
| Luke Nezda |
| Maxim Valyanskiy |
| Niall Pemberton |
| Peter Wolanin |
| Piotr B. |
| Sami Siren |
| Yuan-Fang Li |
| |
| See http://tinyurl.com/yc3dk67 for more details on these contributions. |
| |
| |
| Release 0.5 - 11/14/2009 |
| ------------------------ |
| |
| The most notable changes in Tika 0.5 over the previous release are: |
| |
| * Improved RDF/OWL mime detection using both MIME magic as well as |
| pattern matching (TIKA-309) |
| |
| * An org.apache.tika.Tika facade class has been added to simplify common |
| text extraction and type detection use cases. (TIKA-269) |
| |
| * A new parse context argument was added to the Parser.parse() method. |
| This context map can be used to pass things like a delegate parser or |
| other settings to the parsing process. The previous parse() method |
| signature has been deprecated and will be removed in Tika 1.0. (TIKA-275) |
| |
| * A simple ngram-based language detection mechanism has been added along |
| with predefined language profiles for 18 languages. (TIKA-209) |
| |
| * The media type registry in Tika was synchronized with the MIME type |
| configuration in the Apache HTTP Server. Tika now knows about 1274 |
| different media types and can detect 672 of those using 927 file |
| extension and 280 magic byte patterns. (TIKA-285) |
| |
| * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF |
| documents. This version is notably better than the 0.7.3 release used |
| earlier. (TIKA-158) |
| |
| The following people have contributed to Tika 0.5 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Alex Baranov |
| Bart Hanssens |
| Benson Margulies |
| Chris A. Mattmann |
| Daan de Wit |
| Erik Hetzner |
| Frank Hellwig |
| Jeff Cadow |
| Joachim Zittmayr |
| Jukka Zitting |
| Julien Nioche |
| Ken Krugler |
| Maxim Valyanskiy |
| MRIT64 |
| Paul Borgermans |
| Piotr B. |
| Robert Newson |
| Sascha Szott |
| Ted Dunning |
| Thilo Goetz |
| Uwe Schindler |
| Yuan-Fang Li |
| |
| See http://tinyurl.com/yl9prwp for more details on these contributions. |
| |
| |
| Release 0.4 - 07/14/2009 |
| ------------------------ |
| |
| The most notable changes in Tika 0.4 over the previous release are: |
| |
| * Tika has been split to three different components for increased |
| modularity. The tika-core component contains the key interfaces and |
| core functionality of Tika, tika-parsers contains all the adapters |
| to external parser libraries, and tika-app bundles everything together |
| in a single executable jar file. (TIKA-219) |
| |
| * All the three Tika components are packaged as OSGi bundles. (TIKA-228) |
| |
| * Tika now uses the new Commons Compress library for improved support |
| of compression and packaging formats like gzip, bzip2, tar, cpio, |
| ar, zip and jar. (TIKA-204) |
| |
| * The memory use of parsing Excel sheets with lots of numbers |
| has been considerably reduced. (TIKA-211) |
| |
| * The AutoDetectParser now has basic protection against "zip bomb" |
| attacks, where a specially crafted input document can expand to |
| practically infinite amount of output text. (TIKA-216) |
| |
| * The ParsingReader class can now use a thread pool or a more complex |
| execution model (java.util.concurrent.Executor) for the background |
| parsing task. (TIKA-215) |
| |
| * Automatic type detection of text- and XML-based documents has been |
| improved. (TIKA-225) |
| |
| * Charset detection functionality from the ICU4J library was inlined |
| in Tika to avoid the dependency to the large ICU4J jar. (TIKA-229) |
| |
| * Composite parsers like the AutoDetectParser now make sure that any |
| RuntimeExceptions, IOExceptions or SAXExceptions unrelated to the given |
| document stream or content handler are converted to TikaExceptions |
| before being passed to the client. (TIKA-198, TIKA-237) |
| |
| The following people have contributed to Tika 0.4 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Chris A. Mattmann |
| Daan de Wit |
| Dave Meikle |
| David Weekly |
| Jeremias Maerki |
| Jonathan Koren |
| Jukka Zitting |
| Karl Heinz Marbaise |
| Keith R. Bennett |
| Maxim Valyanskiy |
| Niall Pemberton |
| Robert Burrell Donkin |
| Sami Siren |
| Siddharth Gargate |
| Uwe Schindler |
| |
| See http://tinyurl.com/mgv9o3 for more details on these contributions. |
| |
| |
| Release 0.3 - 03/09/2009 |
| ------------------------ |
| |
| The most notable changes in Tika 0.3 over the previous release are: |
| |
| * Tika now supports mime type glob patterns specified using |
| standard JDK 1.4 (and beyond) syntax via the isregex attribute |
| on the glob tag. See: |
| |
| http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html |
| |
| for more information. (TIKA-194) |
| |
| * Tika now supports the Office Open XML format used by |
| Microsoft Office 2007. (TIKA-152) |
| |
| * All the metadata keys for Microsoft Office document properties are now |
| included as constants in the MSOffice interface. Clients should use |
| these constants instead of the raw string values to refer to specific |
| metadata items. (TIKA-186) |
| |
| * Automatic detection of document types in Tika has been improved. |
| For example Tika can now detect plain text just by looking at the first |
| few bytes of the document. (TIKA-154) |
| |
| * Tika now disables the loading of all external entities in XML files |
| that it parses as input documents. This improves security and avoids |
| problems with potentially broken references. (TIKA-185) |
| |
| * Tika now replaces all invalid XML characters in the extracted text |
| content with spaces. This prevents problems when output from Tika |
| is processed with XML tools. (TIKA-180) |
| |
| * The Tika CLI now correctly flushes its buffers when invoked with the |
| --text argument. This prevents the end of the text output from being |
| lost. (TIKA-179) |
| |
| * Embedded text in MIDI files is now extracted. For example many karaoke |
| files contain song lyrics embedded as MIDI text. |
| |
| * The text content of Microsoft Outlook message files no longer appears as |
| multiple copies in the extracted text. (TIKA-197) |
| |
| * The ParsingReader class now makes most document metadata available |
| already before any of the extracted text is consumed. This makes it |
| easier for example to construct Lucene Document instances that contain |
| both extracted text and metadata. (TIKA-203) |
| |
| See http://tinyurl.com/tika-0-3-changes for a list of all changes in Tika 0.3. |
| |
| The following people have contributed to Tika 0.3 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Andrzej Rusin |
| Chris A. Mattmann |
| Dave Meikle |
| Georger Ara√∫jo |
| Guillermo Arribas |
| Jonathan Koren |
| Jukka Zitting |
| Karl Heinz Marbaise |
| Kumar Raja Jana |
| Paul Borgermans |
| Peter Becker |
| Sébastien Michel |
| Uwe Schindler |
| |
| See http://tinyurl.com/tika-0-3-contributions for more details on |
| these contributions. |
| |
| |
| Release 0.2 - 12/04/2008 |
| ------------------------ |
| |
| 1. TIKA-109 - WordParser fails on some Word files (Dave Meikle) |
| |
| 2. TIKA-105 - Excel parser implementation based on POI's Event API |
| (Niall Pemberton) |
| |
| 3. TIKA-116 - Streaming parser for OpenDocument files (Jukka Zitting) |
| |
| 4. TIKA-117 - Drop JDOM and Jaxen dependencies (Jukka Zitting) |
| |
| 5. TIKA-115 - Tika package with all the dependencies (Jukka Zitting) |
| |
| 6. TIKA-97 - Tika GUI (Jukka Zitting) |
| |
| 7. TIKA-96 - Tika CLI (Jukka Zitting) |
| |
| 8. TIKA-112 - Use Commons IO 1.4 (Jukka Zitting) |
| |
| 9. TIKA-127 - Add support for Visio files (Jukka Zitting) |
| |
| 10. TIKA-129 - node() support for the streaming XPath utility (Jukka Zitting) |
| |
| 11. TIKA-130 - self-or-descendant axis does not match self in streaming XPath |
| (Jukka Zitting) |
| |
| 12. TIKA-131 - Lazy XHTML prefix generation (Jukka Zitting) |
| |
| 13. TIKA-128 - HTML parser should produce XHTML SAX events (Jukka Zitting) |
| |
| 14. TIKA-133 - TeeContentHandler constructor should use varargs (Jukka Zitting) |
| |
| 15. TIKA-132 - Refactor Excel extractor to parse per sheet and add |
| hyperlink support (Niall Pemberton) |
| |
| 16. TIKA-134 - mvn package does not produce packages for bin/src |
| (Karl Heinz Marbaise) |
| |
| 17. TIKA-138 - Ignore HTML style and script content (Jukka Zitting) |
| |
| 18. TIKA-113 - Metadata (such as title) should not be part of content |
| (Jukka Zitting) |
| |
| 19. TIKA-139 - Add a composite parser (Jukka Zitting) |
| |
| 20. TIKA-142 - Include application/xhtml+xml as valid mime type for XMLParser |
| (mattmann) |
| |
| 21. TIKA-143 - Add ParsingReader (Jukka Zitting) |
| |
| 22. TIKA-144 - Upgrade nekohtml dependency (Jukka Zitting) |
| |
| 23. TIKA-145 - Separate NOTICEs and LICENSEs for binary and source packages |
| (Jukka Zitting) |
| |
| 24. TIKA-146 - Upgrade to POI 3.1 (Jukka Zitting) |
| |
| 25. TIKA-99 - Support external parser programs (Jukka Zitting) |
| |
| 26. TIKA-149 - Parser for Zip files (Dave Meikle & Jukka Zitting) |
| |
| 27. TIKA-150 - Parser for tar files (Jukka Zitting) |
| |
| 28. TIKA-151 - Stream compression support (Jukka Zitting) |
| |
| 29. TIKA-156 - Some MIME magic patterns are ignored by MimeTypes |
| (Jukka Zitting) |
| |
| 30. TIKA-155 - Java class file parser (Dave Brosius & Jukka Zitting) |
| |
| 31. TIKA-108 - New Tika logos (Yongqian Li & Jukka Zitting) |
| |
| 32. TIKA-120 - Add support for retrieving ID3 tags from MP3 files |
| (Dave Meikle & Jukka Zitting) |
| |
| 33. TIKA-54 - Outlook msg parser |
| (Rida Benjelloun, Dave Meikle & Jukka Zitting) |
| |
| 34. TIKA-114 - PDFParser : Getting content of the document using |
| "writer.ToString ()" , some words are stuck together |
| (Dave Meikle) |
| |
| 35. TIKA-161 - Enable PMD reports (Jukka Zitting) |
| |
| 36. TIKA-159 - Add support for parsing basic audio types: wav, aiff, au, midi |
| (Sami Siren) |
| |
| 37. TIKA-140 - HTML parser unable to extract text |
| (Julien Nioche & Jukka Zitting) |
| |
| 38. TIKA-163 - GUI does not support drag and drop in Gnome or KDE (Dave Meikle) |
| |
| 39. TIKA-166 - Update HTMLParser to parse contents of meta tags (Dave Meikle) |
| |
| 40. TIKA-164 - Upgrade of the nekohtml dependency to 1.9.9 (Jukka Zitting) |
| |
| 41. TIKA-165 - Upgrade of the ICU4J dependency to version 3.8 (Jukka Zitting) |
| |
| 42. TIKA-172 - New Open Document Parser that emits structured XHTML content |
| (Uwe Schindler & Jukka Zitting) |
| |
| 43. TIKA-175 - Retrotranslate Tika for use in Java 1.4 environments (Jukka Zitting) |
| |
| 44. TIKA-177 - Improvements to build instruction in README (Chris Hostetter & Jukka Zitting) |
| |
| 45. TIKA-171 - New ContentHandler for plain text output that has no problem with |
| missing white space after XHTML block tags (Uwe Schindler & Jukka Zitting) |
| |
| |
| Release 0.1-incubating - 12/27/2007 |
| ----------------------------------- |
| |
| 1. TIKA-5 - Port Metadata Framework from Nutch (mattmann) |
| |
| 2. TIKA-11 - Consolidate test classes into a src/test/java directory tree (mattmann) |
| |
| 3. TIKA-15 - Utils.print does not print a Content having no value (jukka) |
| |
| 4. TIKA-19 - org.apache.tika.TestParsers fails (bdelacretaz) |
| |
| 5. TIKA-16 - Issues with data files used for testing by TestParsers (bdelacretaz) |
| |
| 6. TIKA-14 - MimeTypeUtils.getMimeType() returns the default mime type for |
| .odt (Open Office) file (bdelacretaz) |
| |
| 7. TIKA-12 - Add URL capability to MimeTypesUtils (jukka) |
| |
| 8. TIKA-13 - Fix obsolete package names in config.xml (siren) |
| |
| 9. TIKA-10 - Remove MimeInfoException catch clauses and import from TestParsers (siren) |
| |
| 10. TIKA-8 - Replaced the jmimeinfo dependency with a trivial mime type detector (jukka) |
| |
| 11. TIKA-7 - Added the Lius Lite code. Added missing dependencies to POM (jukka) |
| |
| 12. TIKA-18 - "Office" interface should be renamed "MSOffice" (mattmann) |
| |
| 13. TIKA-23 - Decouple Parser from ParserConfig (jukka) |
| |
| 14. TIKA-6 - Port Nutch (or better) MimeType detection system into Tika (J. Charron & mattmann) |
| |
| 15. TIKA-25 - Removed hardcoded reference to C:\oo.xml in OpenOfficeParser (K. Bennett & jukka) |
| |
| 16. TIKA-17 - Need to support URL's for input resources. (K. Bennett & mattmann) |
| |
| 17. TIKA-22 - Remove @author tags from the java source (mattmann) |
| |
| 18. TIKA-21 - Simplified configuration code (jukka) |
| |
| 19. TIKA-17 - Rename all "Lius" classes to be "Tika" classes (jukka) |
| |
| 20. TIKA-30 - Added utility constructors to TikaConfig (K. Bennett & jukka) |
| |
| 21. TIKA-28 - Rename config.xml to tika-config.xml or similar (mattmann) |
| |
| 22. TIKA-26 - Use Map<String, Content> instead of List<Content> (jukka) |
| |
| 23. TIKA-31 - protected Parser.parse(InputStream stream, |
| Iterable<Content> contents) (jukka & K. Bennett) |
| |
| 24. TIKA-36 - A convenience method for getting a document's content's text |
| would be helpful (K. Bennett & mattmann) |
| |
| 25. TIKA-33 - Stateless parsers (jukka) |
| |
| 26. TIKA-38 - TXTParser adds a space to the content it reads from a file (K. Bennett & ridabenjelloun) |
| |
| 27. TIKA-35 - Extract MsOffice properties, use RereadableInputStream devloped by K. Bennett (ridabenjelloun & K. Bennett) |
| |
| 28. TIKA-39 - Excel parsing improvements (siren & ridabenjelloun) |
| |
| 29. TIKA-34 - Provide a method that will return a default configuration |
| (TikaConfig) (K. Bennett & mattmann) |
| |
| 30. TIKA-42 - Content class needs (String, String, String) constructor (K. Bennett) |
| |
| 31. TIKA-43 - Parser interface (jukka) |
| |
| 32. TIKA-47 - Remove TikaLogger (jukka) |
| |
| 33. TIKA-46 - Use Metadata in Parser (jukka & mattmann) |
| |
| 34. TIKA-48 - Merge MS Extractors and Parsers (jukka) |
| |
| 35. TIKA-45 - RereadableInputStream needs to be able to read to |
| the end of the original stream on first rewind. (K. Bennett) |
| |
| 36. TIKA-41 - Resource files occur twice in jar file. (jukka) |
| |
| 37. TIKA-49 - Some files have old-style license headers, fixed (Robert Burrell Donkin & bdelacretaz) |
| |
| 38. TIKA-51 - Leftover temp files after running Tika tests, fixed (bdelacretaz) |
| |
| 39. TIKA-40 - Tika needs to support diverse character encodings (jukka) |
| |
| 40. TIKA-55 - ParseUtils.getParser() method variants should have consistent parameter orders |
| (K. Bennett) |
| |
| 41. TIKA-52 - RereadableInputStream needs to support not closing the input stream it wraps. |
| (K. Bennett via bdelacretaz) |
| |
| 42. TIKA-53 - XHTML SAX events from parsers (jukka) |
| |
| 43. TIKA-57 - Rename org.apache.tika.ms to org.apache.tika.parser.ms (jukka) |
| |
| 44. TIKA-62 - Use TikaConfig.getDefaultConfig() instead of a hardcoded |
| config path in TestParsers (jukka) |
| |
| 45. TIKA-58 - Replace jtidy html parser with nekohtml based parser (siren) |
| |
| 46. TIKA-60 - Rename Microsoft parser classes (jukka) |
| |
| 47. TIKA-63 - Avoid multiple passes over the input stream in Microsoft parsers |
| (jukka) |
| |
| 48. TIKA-66 - Use Java 5 features in org.apache.tika.mime (jukka) |
| |
| 49. TIKA-56 - Mime type detection fails with upper case file extensions such as "PDF" |
| (mattmann) |
| |
| 50. TIKA-65 - Add encode detection support for HTML parser (siren) |
| |
| 51. TIKA-68 - Add dummy parser classes to be used as sentinels (jukka) |
| |
| 52. TIKA-67 - Add an auto-detecting Parser implementation (jukka) |
| |
| 53. TIKA-70 - Better MIME information for the Open Document formats (jukka) |
| |
| 54. TIKA-71 - Remove ParserConfig and ParserFactory (jukka) |
| |
| 55. TIKA-83 - Create a org.apache.tika.sax package for SAX utilities (jukka) |
| |
| 56. TIKA-84 - Add MimeTypes.getMimeType(InputStream) (jukka) |
| |
| 57. TIKA-85 - Add glob patterns from the ASF svn:eol-style documentation (jukka) |
| |
| 58. TIKA-100 - Structured PDF parsing (jukka) |
| |
| 59. TIKA-101 - Improve site and build (mattmann) |
| |
| 60. TIKA-102 - Parser implementations loading a large amount of content |
| into a single String could be problematic (Niall Pemberton) |
| |
| 61. TIKA-107 - Remove use of assertions for argument checking (Niall Pemberton) |
| |
| 62. TIKA-104 - Add utility methods to throw IOException with the caused |
| intialized (jukka & Niall Pemberton) |
| |
| 63. TIKA-106 - Remove dependency on Jakarta ORO - use JDK 1.4 Regex |
| (Niall Pemberton) |
| |
| 64. TIKA-111 - Missing license headers (jukka) |
| |
| 65. TIKA-112 - XMLParser improvement (ridabenjelloun) |