| Apache Tika Change Log |
| ====================== |
| |
| Release 0.3 - 03/09/2009 |
| ------------------------ |
| |
| The most notable changes in Tika 0.3 over the previous release are: |
| |
| * Tika now supports mime type glob patterns specified using |
| standard JDK 1.4 (and beyond) syntax via the isregex attribute |
| on the glob tag. See: |
| |
| http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html |
| |
| for more information. (TIKA-194) |
| |
| * Tika now supports the Office Open XML format used by |
| Microsoft Office 2007. (TIKA-152) |
| |
| * All the metadata keys for Microsoft Office document properties are now |
| included as constants in the MSOffice interface. Clients should use |
| these constants instead of the raw string values to refer to specific |
| metadata items. (TIKA-186) |
| |
| * Automatic detection of document types in Tika has been improved. |
| For example Tika can now detect plain text just by looking at the first |
| few bytes of the document. (TIKA-154) |
| |
| * Tika now disables the loading of all external entities in XML files |
| that it parses as input documents. This improves security and avoids |
| problems with potentially broken references. (TIKA-185) |
| |
| * Tika now replaces all invalid XML characters in the extracted text |
| content with spaces. This prevents problems when output from Tika |
| is processed with XML tools. (TIKA-180) |
| |
| * The Tika CLI now correctly flushes its buffers when invoked with the |
| --text argument. This prevents the end of the text output from being |
| lost. (TIKA-179) |
| |
| * Embedded text in MIDI files is now extracted. For example many karaoke |
| files contain song lyrics embedded as MIDI text. |
| |
| * The text content of Microsoft Outlook message files no longer appears as |
| multiple copies in the extracted text. (TIKA-197) |
| |
| * The ParsingReader class now makes most document metadata available |
| already before any of the extracted text is consumed. This makes it |
| easier for example to construct Lucene Document instances that contain |
| both extracted text and metadata. (TIKA-203) |
| |
| See http://tinyurl.com/tika-0-3-changes for a list of all changes in Tika 0.3. |
| |
| The following people have contributed to Tika 0.3 by submitting or commenting |
| on the issues resolved in this release: |
| |
| Andrzej Rusin |
| Chris A. Mattmann |
| Dave Meikle |
| Georger Araújo |
| Guillermo Arribas |
| Jonathan Koren |
| Jukka Zitting |
| Karl Heinz Marbaise |
| Kumar Raja Jana |
| Paul Borgermans |
| Peter Becker |
| Sébastien Michel |
| Uwe Schindler |
| |
| See http://tinyurl.com/tika-0-3-contributions for more details on |
| these contributions. |
| |
| Release 0.2 - 12/04/2008 |
| ------------------------ |
| |
| 1. TIKA-109 - WordParser fails on some Word files (Dave Meikle) |
| |
| 2. TIKA-105 - Excel parser implementation based on POI's Event API |
| (Niall Pemberton) |
| |
| 3. TIKA-116 - Streaming parser for OpenDocument files (Jukka Zitting) |
| |
| 4. TIKA-117 - Drop JDOM and Jaxen dependencies (Jukka Zitting) |
| |
| 5. TIKA-115 - Tika package with all the dependencies (Jukka Zitting) |
| |
| 6. TIKA-97 - Tika GUI (Jukka Zitting) |
| |
| 7. TIKA-96 - Tika CLI (Jukka Zitting) |
| |
| 8. TIKA-112 - Use Commons IO 1.4 (Jukka Zitting) |
| |
| 9. TIKA-127 - Add support for Visio files (Jukka Zitting) |
| |
| 10. TIKA-129 - node() support for the streaming XPath utility (Jukka Zitting) |
| |
| 11. TIKA-130 - self-or-descendant axis does not match self in streaming XPath |
| (Jukka Zitting) |
| |
| 12. TIKA-131 - Lazy XHTML prefix generation (Jukka Zitting) |
| |
| 13. TIKA-128 - HTML parser should produce XHTML SAX events (Jukka Zitting) |
| |
| 14. TIKA-133 - TeeContentHandler constructor should use varargs (Jukka Zitting) |
| |
| 15. TIKA-132 - Refactor Excel extractor to parse per sheet and add |
| hyperlink support (Niall Pemberton) |
| |
| 16. TIKA-134 - mvn package does not produce packages for bin/src |
| (Karl Heinz Marbaise) |
| |
| 17. TIKA-138 - Ignore HTML style and script content (Jukka Zitting) |
| |
| 18. TIKA-113 - Metadata (such as title) should not be part of content |
| (Jukka Zitting) |
| |
| 19. TIKA-139 - Add a composite parser (Jukka Zitting) |
| |
| 20. TIKA-142 - Include application/xhtml+xml as valid mime type for XMLParser |
| (mattmann) |
| |
| 21. TIKA-143 - Add ParsingReader (Jukka Zitting) |
| |
| 22. TIKA-144 - Upgrade nekohtml dependency (Jukka Zitting) |
| |
| 23. TIKA-145 - Separate NOTICEs and LICENSEs for binary and source packages |
| (Jukka Zitting) |
| |
| 24. TIKA-146 - Upgrade to POI 3.1 (Jukka Zitting) |
| |
| 25. TIKA-99 - Support external parser programs (Jukka Zitting) |
| |
| 26. TIKA-149 - Parser for Zip files (Dave Meikle & Jukka Zitting) |
| |
| 27. TIKA-150 - Parser for tar files (Jukka Zitting) |
| |
| 28. TIKA-151 - Stream compression support (Jukka Zitting) |
| |
| 29. TIKA-156 - Some MIME magic patterns are ignored by MimeTypes |
| (Jukka Zitting) |
| |
| 30. TIKA-155 - Java class file parser (Dave Brosius & Jukka Zitting) |
| |
| 31. TIKA-108 - New Tika logos (Yongqian Li & Jukka Zitting) |
| |
| 32. TIKA-120 - Add support for retrieving ID3 tags from MP3 files |
| (Dave Meikle & Jukka Zitting) |
| |
| 33. TIKA-54 - Outlook msg parser |
| (Rida Benjelloun, Dave Meikle & Jukka Zitting) |
| |
| 34. TIKA-114 - PDFParser : Getting content of the document using |
| "writer.ToString ()" , some words are stuck together |
| (Dave Meikle) |
| |
| 35. TIKA-161 - Enable PMD reports (Jukka Zitting) |
| |
| 36. TIKA-159 - Add support for parsing basic audio types: wav, aiff, au, midi |
| (Sami Siren) |
| |
| 37. TIKA-140 - HTML parser unable to extract text |
| (Julien Nioche & Jukka Zitting) |
| |
| 38. TIKA-163 - GUI does not support drag and drop in Gnome or KDE (Dave Meikle) |
| |
| 39. TIKA-166 - Update HTMLParser to parse contents of meta tags (Dave Meikle) |
| |
| 40. TIKA-164 - Upgrade of the nekohtml dependency to 1.9.9 (Jukka Zitting) |
| |
| 41. TIKA-165 - Upgrade of the ICU4J dependency to version 3.8 (Jukka Zitting) |
| |
| 42. TIKA-172 - New Open Document Parser that emits structured XHTML content |
| (Uwe Schindler & Jukka Zitting) |
| |
| 43. TIKA-175 - Retrotranslate Tika for use in Java 1.4 environments (Jukka Zitting) |
| |
| 44. TIKA-177 - Improvements to build instruction in README (Chris Hostetter & Jukka Zitting) |
| |
| 45. TIKA-171 - New ContentHandler for plain text output that has no problem with |
| missing white space after XHTML block tags (Uwe Schindler & Jukka Zitting) |
| |
| Release 0.1-incubating - 12/27/2007 |
| ----------------------------------- |
| |
| 1. TIKA-5 - Port Metadata Framework from Nutch (mattmann) |
| |
| 2. TIKA-11 - Consolidate test classes into a src/test/java directory tree (mattmann) |
| |
| 3. TIKA-15 - Utils.print does not print a Content having no value (jukka) |
| |
| 4. TIKA-19 - org.apache.tika.TestParsers fails (bdelacretaz) |
| |
| 5. TIKA-16 - Issues with data files used for testing by TestParsers (bdelacretaz) |
| |
| 6. TIKA-14 - MimeTypeUtils.getMimeType() returns the default mime type for |
| .odt (Open Office) file (bdelacretaz) |
| |
| 7. TIKA-12 - Add URL capability to MimeTypesUtils (jukka) |
| |
| 8. TIKA-13 - Fix obsolete package names in config.xml (siren) |
| |
| 9. TIKA-10 - Remove MimeInfoException catch clauses and import from TestParsers (siren) |
| |
| 10. TIKA-8 - Replaced the jmimeinfo dependency with a trivial mime type detector (jukka) |
| |
| 11. TIKA-7 - Added the Lius Lite code. Added missing dependencies to POM (jukka) |
| |
| 12. TIKA-18 - "Office" interface should be renamed "MSOffice" (mattmann) |
| |
| 13. TIKA-23 - Decouple Parser from ParserConfig (jukka) |
| |
| 14. TIKA-6 - Port Nutch (or better) MimeType detection system into Tika (J. Charron & mattmann) |
| |
| 15. TIKA-25 - Removed hardcoded reference to C:\oo.xml in OpenOfficeParser (K. Bennett & jukka) |
| |
| 16. TIKA-17 - Need to support URL's for input resources. (K. Bennett & mattmann) |
| |
| 17. TIKA-22 - Remove @author tags from the java source (mattmann) |
| |
| 18. TIKA-21 - Simplified configuration code (jukka) |
| |
| 19. TIKA-17 - Rename all "Lius" classes to be "Tika" classes (jukka) |
| |
| 20. TIKA-30 - Added utility constructors to TikaConfig (K. Bennett & jukka) |
| |
| 21. TIKA-28 - Rename config.xml to tika-config.xml or similar (mattmann) |
| |
| 22. TIKA-26 - Use Map<String, Content> instead of List<Content> (jukka) |
| |
| 23. TIKA-31 - protected Parser.parse(InputStream stream, |
| Iterable<Content> contents) (jukka & K. Bennett) |
| |
| 24. TIKA-36 - A convenience method for getting a document's content's text |
| would be helpful (K. Bennett & mattmann) |
| |
| 25. TIKA-33 - Stateless parsers (jukka) |
| |
| 26. TIKA-38 - TXTParser adds a space to the content it reads from a file (K. Bennett & ridabenjelloun) |
| |
| 27. TIKA-35 - Extract MsOffice properties, use RereadableInputStream devloped by K. Bennett (ridabenjelloun & K. Bennett) |
| |
| 28. TIKA-39 - Excel parsing improvements (siren & ridabenjelloun) |
| |
| 29. TIKA-34 - Provide a method that will return a default configuration |
| (TikaConfig) (K. Bennett & mattmann) |
| |
| 30. TIKA-42 - Content class needs (String, String, String) constructor (K. Bennett) |
| |
| 31. TIKA-43 - Parser interface (jukka) |
| |
| 32. TIKA-47 - Remove TikaLogger (jukka) |
| |
| 33. TIKA-46 - Use Metadata in Parser (jukka & mattmann) |
| |
| 34. TIKA-48 - Merge MS Extractors and Parsers (jukka) |
| |
| 35. TIKA-45 - RereadableInputStream needs to be able to read to |
| the end of the original stream on first rewind. (K. Bennett) |
| |
| 36. TIKA-41 - Resource files occur twice in jar file. (jukka) |
| |
| 37. TIKA-49 - Some files have old-style license headers, fixed (Robert Burrell Donkin & bdelacretaz) |
| |
| 38. TIKA-51 - Leftover temp files after running Tika tests, fixed (bdelacretaz) |
| |
| 39. TIKA-40 - Tika needs to support diverse character encodings (jukka) |
| |
| 40. TIKA-55 - ParseUtils.getParser() method variants should have consistent parameter orders |
| (K. Bennett) |
| |
| 41. TIKA-52 - RereadableInputStream needs to support not closing the input stream it wraps. |
| (K. Bennett via bdelacretaz) |
| |
| 42. TIKA-53 - XHTML SAX events from parsers (jukka) |
| |
| 43. TIKA-57 - Rename org.apache.tika.ms to org.apache.tika.parser.ms (jukka) |
| |
| 44. TIKA-62 - Use TikaConfig.getDefaultConfig() instead of a hardcoded |
| config path in TestParsers (jukka) |
| |
| 45. TIKA-58 - Replace jtidy html parser with nekohtml based parser (siren) |
| |
| 46. TIKA-60 - Rename Microsoft parser classes (jukka) |
| |
| 47. TIKA-63 - Avoid multiple passes over the input stream in Microsoft parsers |
| (jukka) |
| |
| 48. TIKA-66 - Use Java 5 features in org.apache.tika.mime (jukka) |
| |
| 49. TIKA-56 - Mime type detection fails with upper case file extensions such as "PDF" |
| (mattmann) |
| |
| 50. TIKA-65 - Add encode detection support for HTML parser (siren) |
| |
| 51. TIKA-68 - Add dummy parser classes to be used as sentinels (jukka) |
| |
| 52. TIKA-67 - Add an auto-detecting Parser implementation (jukka) |
| |
| 53. TIKA-70 - Better MIME information for the Open Document formats (jukka) |
| |
| 54. TIKA-71 - Remove ParserConfig and ParserFactory (jukka) |
| |
| 55. TIKA-83 - Create a org.apache.tika.sax package for SAX utilities (jukka) |
| |
| 56. TIKA-84 - Add MimeTypes.getMimeType(InputStream) (jukka) |
| |
| 57. TIKA-85 - Add glob patterns from the ASF svn:eol-style documentation (jukka) |
| |
| 58. TIKA-100 - Structured PDF parsing (jukka) |
| |
| 59. TIKA-101 - Improve site and build (mattmann) |
| |
| 60. TIKA-102 - Parser implementations loading a large amount of content |
| into a single String could be problematic (Niall Pemberton) |
| |
| 61. TIKA-107 - Remove use of assertions for argument checking (Niall Pemberton) |
| |
| 62. TIKA-104 - Add utility methods to throw IOException with the caused |
| intialized (jukka & Niall Pemberton) |
| |
| 63. TIKA-106 - Remove dependency on Jakarta ORO - use JDK 1.4 Regex |
| (Niall Pemberton) |
| |
| 64. TIKA-111 - Missing license headers (jukka) |
| |
| 65. TIKA-112 - XMLParser improvement (ridabenjelloun) |