1 / 35

Content Detection and Analysis

Explore the importance of content detection, challenges, and approaches using Apache Tika for MIME type detection, language identification, and metadata extraction in search engine architecture. Learn how to integrate parsing libraries effectively.

scottpeter
Download Presentation

Content Detection and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content Detection and Analysis CSCI 572: Information Retrieval and Search Engines

  2. Outline • The Information Landscape • Importance of Content Detection • Challenges • Approaches

  3. The Information Landscape

  4. Proliferation of content types available • By some accounts, 16K to 51K content types* • What to do with content types? • Parse them • How? • Extract their text and structure • Index their metadata • In an indexing technology like Lucene, Solr, or Compass, or in Google Appliance • Identify what language they belong to • Ngrams *http://filext.com/

  5. Importance of content types

  6. Importance of content type detection

  7. Search Engine Architecture

  8. Goals • Identify and classify file types • MIME detection • Glob pattern • *.txt • *.pdf • URL • http://…pdf • ftp://myfile.txt • Magic bytes • Combination of the above means • Classification means reaction can be targeted

  9. Goals • Parsing • Based on MIMEtype in anautomated fashion • Extraction of Text and Metadata • Text content can be fed into • Search engine • Machine learning/Statistical analysis • Used to subset data from a formatted document • Metadata can be used for field/faceted search

  10. Many custom applications and tools • You need this: to to read this:

  11. Third-party parsing libraries • Most of the custom applications come with software libraries and tools to read/write these files • Rather than re-invent the wheel, figure out a way to take advantage of them • Parsing text and structure is a difficult problem • Not all libraries parse text in equivalent manners • Some are faster than others • Some are more reliable than others

  12. Extraction of Metadata • Important to follow common Metadata models • Dublin Core • Word Metadata • XMP • EXIF • Lots of standards and models out there • The use and extraction of common models allows for content intercomparison • All standardizes mechanisms for searching • You always know for X file type that field Y is there and of type String or Int or Date

  13. Cancer Research Example

  14. Cancer Research Example Attributes Relationships

  15. Language Identification • Hard to parse out text and metadata from different languages • French document: J’aime la classe de CS 572! • Metadata: • Publisher: L’Universitaire de Californie en Etas-Unis de Sud • English document: I love the CS 572 class! • Metadata: • Publisher: University of Southern California • How to compare these 2 extracted texts and sets of metadata when they are in different languages?

  16. Methods for language identification • N-grams • Method of detecting next character or set of characters in a sequence • Useful in determine whether small snippets of text come from a particular language, or character set • Non-computational approaches • Tagging • Looking for common words or characters

  17. Challenges • Ability to uniformly extract and present metadata • Scale • Extract on the fly, or extract during indexing? • Utility of content detection and analysis important both prior to indexing and after • Integrating third-party parsing libraries is difficult • Many intrinsic dependencies • Non-uniform extraction interfaces • Some don’t provide the same content • Slowdown

  18. Challenges • Language and charset detection is hard!

  19. Challenges • Maintenance of MIME type database as new MIMEs are constantly being identified • Ensuring portability since content type detection and identification is becoming more and more needed even outside of the search engine • Firefox, Safari, HTTPD, etc., all must know about MIME types

  20. Wrapup • Content detection and analysis • MIME detection • Parsing and integration of parsing libraries • Language identification • Charset identification • Common Metadata models and formats • Use in a number of areas within the domain of search engines

  21. Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines

  22. Outline • What is Tika? • Where did it come from? • What are the current versions of Tika? • What can it do?

  23. Apache Tika is… • A content analysis and detection toolkit • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries • A rich Metadata API for representing different Metadata models • A command line interface to the underlying Java code • A GUI interface to the Java code

  24. Tika’s (Brief) History • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 • Proposed as Lucene sub-project • Others interested, didn’t gain much traction • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit • A Content Management System • Graduated from the Incubator to Lucene sub-project in 2008 • Graduated to Apache TLP in 2010

  25. Getting started rapidly • Download Tika from: • http://tika.apache.org/download.html • Grab tika-app-0.7.jar • alias tika “java –jar tika-app-0.7.jar” • tika < somefile.doc > extracted-text.xhtml • tika –m < somefile.doc > extracted.met

  26. Detecting MIME types from Java • String type = Tika.detect(…) • java.io.InputStream • java.io.File • java.net.URL • java.lang.String

  27. Adding new MIME types • Got XML?

  28. Parsing • String content = Tika.parseToString(…) • InputStream • File • URL

  29. Streaming Parsing • Reader reader = Tika.parse(…) • InputStream • File • URL

  30. Language Detection • LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile(FileUtils.readFileToString(newFile(filename)))); • System.out.println(lang.getLanguage()); • Uses Ngram analysis included with Tika • Originating from Nutch • Can be improved

  31. Metadata • Metadata met = new Metadata();//Dubiln Coremet.set(Metadata.FORMAT, “text/html”);//multi-valuedmet.set(Metadata.FORMAT, “text/plain”);System.out.println(met.getValues(Metadata.FORMAT)); • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forcast, etc.)

  32. Running Tika in GUI form • tika --gui <html xmlns:html=“…”><body> …</body> </html>

  33. Integrating Tika into your App • Maven • Ant • Eclipse • It’s just a set of jars • tika-core • tika-parsers • tika-app • tika-bundle tika-app tika-bundle tika-parsers tika-core

  34. Wrapup • Lots more information at • http://tika.apache.org • Possible projects • Adding more parsers for content types • Omnigraffle? • Expanding ability to handle random access file parsing • Scientific data file formats • Improving language and charset detection

  35. Acknowledgements • Material inspired by Jukka Zitting’s talks • http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika • http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630

More Related