Unlocking Content Diversity with Apache Tika: A Guide to Extraction and Detection

Apache Tika: 1 point Oh! Chris A. MattmannNASA JPL/Univ. Southern California/ASF mattmann@apache.org November 9, 2011

And you are? • Senior Computer Scientist at NASA JPL in Pasadena, CA USA • Software Architecture/Engineering Prof at Univ. of Southern California • Apache Member involved in • OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)

Roadmap • 1st part of the talk • Why Tika? • What is Tika? • What are the current versions of Tika? • What can it do? • 2nd part of the talk • NASA Earth Science Data Systems • Data System Needs and Requirements • How does Tika help?

The Information Landscape

Proliferation of content types available • By some accounts, 16K to 51K content types* • What to do with content types? • Parse them • How? • Extract their text and structure • Index their metadata • In an indexing technology like Lucene, Solr, or in Google Appliance • Identify what language they belong to • Ngrams *http://filext.com/

Importance of content types

Importance of content type detection

Search Engine Architecture

Goals • Identify and classify file types • MIME detection • Glob pattern • *.txt • *.pdf • URL • http://…pdf • ftp://myfile.txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted

is… • A content analysis and detection toolkit • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries • A rich Metadata API for representing different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code

Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project • Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit • A Content Management System • Graduated from the Incubator to Lucene sub-project in 2008 • Graduated to Apache TLP in April 2010 • 40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9

Community • Mailing lists • User: 125 peeps, ~70 msg/mo. • Dev: 210 peeps, ~250 msg/mo. • Committers/PMC • 13 peeps • Large majority of them active • Releases • 11 releases so far • Just pushed out 1 point OH • http://s.apache.org/N0I Credit: svnsearch.org

Use in the classroom • Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval class and my Software Architecture class • Several student final projects have turned into contributions for the project and merit for the students • Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.

Some recent 1 point oh press

Getting started rapidly…like now! • Download Tika from: • http://tika.apache.org/download.html • Grab tika-app-1.0.jar • alias tika “java –jar tika-app-1.0.jar” • tika < somefile.doc > extracted-text.xhtml • tika –m < somefile.doc > extracted.met • Works on Windows too (alias only on UNIX)

A quick NASA dataset • Atmospheric Infrared Sounder Mission (AIRS) • Level 2 Cloud Clear Radiance Product • Grab it from here: • ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/ • Just grab the first file • java -jar tika-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf • Hopefully this worked for you, if not, blame.. • Windows • And Bill Gates CORDEX-MATTMANN

Detecting MIME types from Java • String type = Tika.detect(…) • java.io.InputStream • java.io.File • java.net.URL • java.lang.String

Adding new MIME types • Got XML? • Based on freedesktop.org spec (loosely)

Many custom applications and tools • You need this: to read this:

Third-party parsing libraries • Most of the custom applications come with software libraries and tools to read/write these files • Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem • Not all libraries parse text in equivalent manners • Some are faster than others • Some are more reliable than others

Parsing • String content = Tika.parseToString(…) • InputStream • File • URL

Streaming Parsing • Reader reader = Tika.parse(…) • InputStream • File • URL

Extraction of Metadata • Important to follow common Metadata models • Dublin Core – any electronic resource • XMP – also general like Dublin Core • Word Metadata – specific to .doc, .ppt, etc. • EXIF – image related • Lots of standards and models out there • The use and extraction of common models allows for content intercomparison • All standardize mechanisms for searching • You always know for X file type that field Y is there and of type String or Int or Date

Cancer Research Example

Cancer Research Example Attributes Credit: A. Hart Relationships

Tika Sponsoring the Any23 Project • Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) • Any23 = “Anything to Triples” • Semantic Toolkit for parsing, identification of all major semantic web content types (RDF, etc.) • Related to Apache Jena • Looking for synergies between 2 efforts

Metadata • Metadata met = new Metadata();//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT)); • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.) • Run: tika --list-met-models

Methods for language identification • N-grams • Method of detecting next character or set of characters in a sequence • Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches • Tagging • Looking for common words or characters

Language Detection • LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename)))); • System.out.println(lang.getLanguage()); • Uses Ngram analysis included with Tika • Originating from Nutch • Can be improved

Running Tika in GUI form • tika --gui <html xmlns:html=“…”><body> …</body> </html>

Integrating Tika into your App • Maven • Ant • Eclipse • It’s just a set of jars • tika-core • tika-parsers • tika-app • tika-bundle • tika-server tika-app tika-bundle tika-server tika-parsers tika-core

Some really great stuff in 1.0 NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen • Super improved OSGi support • New tika-bundle module • Improved RTF parsing support, OO support, and parsing of Outlook email attachments • Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian • Improved PDF parsing (extract annotation)

Things to watch out for • Deprecated APIs->gone • Recompile code • No more JDK 1.4 version of Tika • Upgrade

Improvements to Tika • Adding more parsers for content types • Improve the JAX-RS server support • Expanding ability to handle random access file parsing • Scientific data file formats, some work on this • Leverage improvements in file representation TIKA-701, TIKA-654, TIKA-645, TIKA-153 • Geospatial parsing support through GDAL • Improving language and charset detection

Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295

NASA Ground Data Systems Credit: D. Woollard

Context • NASA develops science data processing systems for multiple earth science missions • These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research • Typical characteristics • Remote sensing instruments that orbit the Earth multiple times daily • Data are acquired constantly • Complex algorithms convert instrument measurements to geophysical quantities

The Square Kilometer Array • 1 sq. km ofantennas • Never-beforeseen resolution looking intothe sky • 700 TB • Per second!

NASA DESDynI Mission • 16 TB/day • Geographically distributed • 10s of 1000s of jobs per day • Tier 1 Earth Science Decadal Mission

Some Considerations • Scale • Data throughput rates • # of data types • # of metadata types • # of users to send the data to • Federation • Must leave the data where it is • Socio/Economic/Political • Heterogeneity • Technology, data formats, skills!

Apache OODT • We’ve got some components to deal with these issues

How are we building these systems now? • Allow for push/pull of data over arbitrary protocols- Ingestion builds std catalog and archive • Deliver product metadata to search, portal or GIS • Plug in arbitrary met extractors

How are we building these systems now? • Separation of file management from workflow management • Allow for heterogeneous computing resources • Easily integrate PGEs • Leverages same ingestion crawler

What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA! Metadata Ext: TIKA!

What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!

Science Data File Formats • Hierarchical Data Format (HDF) • http://www.hdfgroup.org • Versions 4 and 5 • Lots of NASA data is in 4, newer NASA data in 5 • Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) • Custom readers/writers/APIs in many languages • C/C++, Python, Java

Science Data File Formats • network Common Data Form (netCDF) • www.unidata.ucar.edu/software/netcdf/ • Versions 3 and 4 • Heavily used in DOE, NOAA, etc. • Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) • Custom readers/writers/APIs in many languages • C/C++, Python, Java • Not Hierarchical representation: all flat

So how does it work? • Ingestion • Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format • Need to extract their met, catalog and archive them, etc. • Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability • Processing • Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive

Tool support • Entire stacks of tools written around these formats • OPeNDAP, LAS, readers, writers, custom NASA mission toolkits • OGC • WMS, WCS, etc. • Unique, one of a kind software build around these data file formats • Apache can contribute strongly in this area!

Besides processing science files • …Tika also helps with • MIME identification • Useful in remote file acquisition • Useful in classification (catalog/archive) of existing content • Useful in crawling see my Nutch talk last year http://s.apache.org/UvU • Language identification • Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it

Unlocking Content Diversity with Apache Tika: A Guide to Extraction and Detection