1 / 55

Unlocking Content Diversity with Apache Tika: A Guide to Extraction and Detection

Discover how Apache Tika enables efficient content type detection and extraction of text and metadata. Explore the tool's capabilities, history, and community involvement. Learn how Tika can benefit data management projects and educational endeavors. Get started with Tika now and delve into its powerful features.

agillham
Download Presentation

Unlocking Content Diversity with Apache Tika: A Guide to Extraction and Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Tika: 1 point Oh! Chris A. MattmannNASA JPL/Univ. Southern California/ASF mattmann@apache.org November 9, 2011

  2. And you are? • Senior Computer Scientist at NASA JPL in Pasadena, CA USA • Software Architecture/Engineering Prof at Univ. of Southern California • Apache Member involved in • OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)

  3. Roadmap • 1st part of the talk • Why Tika? • What is Tika? • What are the current versions of Tika? • What can it do? • 2nd part of the talk • NASA Earth Science Data Systems • Data System Needs and Requirements • How does Tika help?

  4. The Information Landscape

  5. Proliferation of content types available • By some accounts, 16K to 51K content types* • What to do with content types? • Parse them • How? • Extract their text and structure • Index their metadata • In an indexing technology like Lucene, Solr, or in Google Appliance • Identify what language they belong to • Ngrams *http://filext.com/

  6. Importance of content types

  7. Importance of content type detection

  8. Search Engine Architecture

  9. Goals • Identify and classify file types • MIME detection • Glob pattern • *.txt • *.pdf • URL • http://…pdf • ftp://myfile.txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted

  10. is… • A content analysis and detection toolkit • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries • A rich Metadata API for representing different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code

  11. Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project • Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit • A Content Management System • Graduated from the Incubator to Lucene sub-project in 2008 • Graduated to Apache TLP in April 2010 • 40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9

  12. Community • Mailing lists • User: 125 peeps, ~70 msg/mo. • Dev: 210 peeps, ~250 msg/mo. • Committers/PMC • 13 peeps • Large majority of them active • Releases • 11 releases so far • Just pushed out 1 point OH • http://s.apache.org/N0I Credit: svnsearch.org

  13. Use in the classroom • Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval class and my Software Architecture class • Several student final projects have turned into contributions for the project and merit for the students • Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.

  14. Some recent 1 point oh press

  15. Getting started rapidly…like now! • Download Tika from: • http://tika.apache.org/download.html • Grab tika-app-1.0.jar • alias tika “java –jar tika-app-1.0.jar” • tika < somefile.doc > extracted-text.xhtml • tika –m < somefile.doc > extracted.met • Works on Windows too (alias only on UNIX)

  16. A quick NASA dataset • Atmospheric Infrared Sounder Mission (AIRS) • Level 2 Cloud Clear Radiance Product • Grab it from here: • ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/ • Just grab the first file • java -jar tika-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf • Hopefully this worked for you, if not, blame.. • Windows • And Bill Gates CORDEX-MATTMANN

  17. Detecting MIME types from Java • String type = Tika.detect(…) • java.io.InputStream • java.io.File • java.net.URL • java.lang.String

  18. Adding new MIME types • Got XML? • Based on freedesktop.org spec (loosely)

  19. Many custom applications and tools • You need this: to read this:

  20. Third-party parsing libraries • Most of the custom applications come with software libraries and tools to read/write these files • Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem • Not all libraries parse text in equivalent manners • Some are faster than others • Some are more reliable than others

  21. Parsing • String content = Tika.parseToString(…) • InputStream • File • URL

  22. Streaming Parsing • Reader reader = Tika.parse(…) • InputStream • File • URL

  23. Extraction of Metadata • Important to follow common Metadata models • Dublin Core – any electronic resource • XMP – also general like Dublin Core • Word Metadata – specific to .doc, .ppt, etc. • EXIF – image related • Lots of standards and models out there • The use and extraction of common models allows for content intercomparison • All standardize mechanisms for searching • You always know for X file type that field Y is there and of type String or Int or Date

  24. Cancer Research Example

  25. Cancer Research Example Attributes Credit: A. Hart Relationships

  26. Tika Sponsoring the Any23 Project • Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) • Any23 = “Anything to Triples” • Semantic Toolkit for parsing, identification of all major semantic web content types (RDF, etc.) • Related to Apache Jena • Looking for synergies between 2 efforts

  27. Metadata • Metadata met = new Metadata();//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT)); • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.) • Run: tika --list-met-models

  28. Methods for language identification • N-grams • Method of detecting next character or set of characters in a sequence • Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches • Tagging • Looking for common words or characters

  29. Language Detection • LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename)))); • System.out.println(lang.getLanguage()); • Uses Ngram analysis included with Tika • Originating from Nutch • Can be improved

  30. Running Tika in GUI form • tika --gui <html xmlns:html=“…”><body> …</body> </html>

  31. Integrating Tika into your App • Maven • Ant • Eclipse • It’s just a set of jars • tika-core • tika-parsers • tika-app • tika-bundle • tika-server tika-app tika-bundle tika-server tika-parsers tika-core

  32. Some really great stuff in 1.0 NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen • Super improved OSGi support • New tika-bundle module • Improved RTF parsing support, OO support, and parsing of Outlook email attachments • Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian • Improved PDF parsing (extract annotation)

  33. Things to watch out for • Deprecated APIs->gone • Recompile code • No more JDK 1.4 version of Tika • Upgrade

  34. Improvements to Tika • Adding more parsers for content types • Improve the JAX-RS server support • Expanding ability to handle random access file parsing • Scientific data file formats, some work on this • Leverage improvements in file representation TIKA-701, TIKA-654, TIKA-645, TIKA-153 • Geospatial parsing support through GDAL • Improving language and charset detection

  35. Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295

  36. NASA Ground Data Systems Credit: D. Woollard

  37. Context • NASA develops science data processing systems for multiple earth science missions • These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research • Typical characteristics • Remote sensing instruments that orbit the Earth multiple times daily • Data are acquired constantly • Complex algorithms convert instrument measurements to geophysical quantities

  38. The Square Kilometer Array • 1 sq. km ofantennas • Never-beforeseen resolution looking intothe sky • 700 TB • Per second!

  39. NASA DESDynI Mission • 16 TB/day • Geographically distributed • 10s of 1000s of jobs per day • Tier 1 Earth Science Decadal Mission

  40. Some Considerations • Scale • Data throughput rates • # of data types • # of metadata types • # of users to send the data to • Federation • Must leave the data where it is • Socio/Economic/Political • Heterogeneity • Technology, data formats, skills!

  41. Apache OODT • We’ve got some components to deal with these issues

  42. How are we building these systems now? • Allow for push/pull of data over arbitrary protocols- Ingestion builds std catalog and archive • Deliver product metadata to search, portal or GIS • Plug in arbitrary met extractors

  43. How are we building these systems now? • Separation of file management from workflow management • Allow for heterogeneous computing resources • Easily integrate PGEs • Leverages same ingestion crawler

  44. What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA! Metadata Ext: TIKA!

  45. What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!

  46. Science Data File Formats • Hierarchical Data Format (HDF) • http://www.hdfgroup.org • Versions 4 and 5 • Lots of NASA data is in 4, newer NASA data in 5 • Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) • Custom readers/writers/APIs in many languages • C/C++, Python, Java

  47. Science Data File Formats • network Common Data Form (netCDF) • www.unidata.ucar.edu/software/netcdf/ • Versions 3 and 4 • Heavily used in DOE, NOAA, etc. • Encapsulates • Observation (Scalars, Vectors, Matrices, NxMxZ…) • Metadata (Summary info, date/time ranges, spatial ranges) • Custom readers/writers/APIs in many languages • C/C++, Python, Java • Not Hierarchical representation: all flat

  48. So how does it work? • Ingestion • Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format • Need to extract their met, catalog and archive them, etc. • Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability • Processing • Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive

  49. Tool support • Entire stacks of tools written around these formats • OPeNDAP, LAS, readers, writers, custom NASA mission toolkits • OGC • WMS, WCS, etc. • Unique, one of a kind software build around these data file formats • Apache can contribute strongly in this area!

  50. Besides processing science files • …Tika also helps with • MIME identification • Useful in remote file acquisition • Useful in classification (catalog/archive) of existing content • Useful in crawling see my Nutch talk last year http://s.apache.org/UvU • Language identification • Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it

More Related