1. XenForo 1.5.14 中文版——支持中文搜索!现已发布!查看详情
  2. Xenforo 爱好者讨论群:215909318 XenForo专区

新闻 Apache Tika 1.7 发布,文本内容抽取集 下载

本帖由 漂亮的石头2015-01-17 发布。版面名称:软件资讯

  1. 漂亮的石头

    漂亮的石头 版主 管理成员

    Apache Tika 1.7 发布了,Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。

    该版本包含很多改进和 bug 修复,详细列表如下:

    * Fixed resource leak in OutlookPSTParser that caused TikaException
    when invoked via AutoDetectParser on Windows (TIKA-1506).

    * HTML tags are properly stripped from content by FeedParser

    * Tika Server support for selecting a single metadata key;
    wrapped MetadataEP into MetadataResource (TIKA-1499).

    * Tika Server support for JSON and XMP views of metadata (TIKA-1497).

    * Tika Parent uses dependency management to keep duplicate
    dependencies in different modules the same version (TIKA-1384).

    * Upgraded slf4j to version 1.7.7 (TIKA-1496).

    * Tika Server support for RecursiveParserWrapper's JSON output
    (endpoint=rmeta) equivalent to (TIKA-1451's) -J option
    in tika-app (TIKA-1498).

    * Tika Server support for providing the password for files on a
    per-request basis through the Password http header (TIKA-1494).

    * Simple support for the BPG (Better Portable Graphics) image format
    (TIKA-1491, TIKA-1495).

    * Prevent exceptions from being thrown for some malformed
    mp3 files (TIKA-1218).

    * Reformat pom.xml files to use two spaces per indent (TIKA-1475).

    * Fix warning of slf4j logger on Tika Server startup (TIKA-1472).

    * Tika CLI and GUI now have option to view JSON rendering of output
    of RecursiveParserWrapper (TIKA-1451).

    * Tika now integrates the Geospatial Data Abstraction Library
    (GDAL) for parsing hundreds of geospatial formats (TIKA-605,

    * ExternalParsers can now use Regexs to specify dynamic keys

    * Thread safety issues in ImageMetadataExtractor were resolved

    * The ForkParser service is now registered in Activator

    * The Rome Library was upgraded to version 1.5 (TIKA-1435).

    * Add markup for files embedded in PDFs (TIKA-1427).

    * Extract files embedded in annotations in PDFS (TIKA-1433).

    * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).

    * Add RecursiveParserWrapper (aka Jukka's and Nick's)
    RecursiveMetadataParser (TIKA-1329)

    * Add example for how to dump TikaConfig to XML (TIKA-1418).

    * Allow users to specify a tika config file for tika-app (TIKA-1426).

    * PackageParser includes the last-modified date from the archive
    in the metadata, when handling embedded entries (TIKA-1246)

    * Created a new Tesseract OCR Parser to extract text from images.
    Requires installation of Tesseract before use (TIKA-93).

    * Basic parser for older Excel formats, such as Excel 4, 5 and 95,
    which can get simple text, and metadata for Excel 5+95 (TIKA-1490)
    Apache Tika 1.7 发布,文本内容抽取集下载地址