Apache UIMA TikaAnnotator README file

INTRODUCTION

Apache Tika is a toolkit for detecting and extracting metadata 
and structured text content from various documents 
using existing parser libraries. 

TikaAnnotator uses Tika to generate annotations representing 
the original markup of a document, extract its text and metadata. 
It consists of three resources (see /desc):

- FileSystemCollectionReader : similar to the one in UIMA examples but uses 
  TIKA to extract the text from binary documents and generates annotations 
  to represent the markup
- MarkupAnnotator : takes the original content from a view and generates 
  a new view containing the extracted text with markup annotations
- TikaWrapper : utility class which allows to populate a CAS 
  from a binary document; used by the FileSystemCollectionReader
  
VERSION

This version wraps Tika 0.4.  In that version of Tika, the packaging 
for Tika was split into several parts.

The tika-core jar contains only the core client-visible classes and
interfaces and has zero dependencies beyond Java 5. All the actual
parser implementations and external parser dependencies are in the
tika-parsers jar. 

See http://lucene.apache.org/tika/gettingstarted.html for the full
details.

COMPILATION 

You can use the ANT script to compile the sources. 
Note that you need to add the Tika-jars in the /lib directory; 
it is recommended to use the Tika-*-standalone.jar 
which contains all the libraries used internally by Tika.

For more information on UIMA, see:
  http://incubator.apache.org/uima

For more information on Tika, see:
  http://incubator.apache.org/tika/
  
Crypto Notice
-------------

The binary distributions include cryptographic software.  The country in 
   which you currently reside may have restrictions on the import, 
   possession, use, and/or re-export to another country, of 
   encryption software.  BEFORE using any encryption software, please 
   check your country's laws, regulations and policies concerning the
   import, possession, or use, and re-export of encryption software, to 
   see if this is permitted.  See <http://www.wassenaar.org/> for more
   information.

   The U.S. Government Department of Commerce, Bureau of Industry and
   Security (BIS), has classified this software as Export Commodity 
   Control Number (ECCN) 5D002.C.1, which includes information security
   software using or performing cryptographic functions with asymmetric
   algorithms.  The form and manner of this Apache Software Foundation
   distribution makes it eligible for export under the License Exception
   ENC Technology Software Unrestricted (TSU) exception (see the BIS 
   Export Administration Regulations, Section 740.13) for both object 
   code and source code.

   The following provides more details on the included cryptographic
   software:
   
   The binary distributions include portions of Apache Tika, which, in
   turn, is classified as being controlled under ECCN 5D002.  
