Search And Text Analysis

Search And Text Analysis An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG

Overview • Background • Taming Text • Importance • Foundations • Language Basics • Obtaining Text • Tools for Search and Text Analysis • Concepts • Demos • Resources

Taming Text • http://www.manning.com/ingersoll • Grant Ingersoll • Lucene/Solr/Mahout committer • Tom Morton • OpenNLP author • Practical aspects of Text Analysis • Open Source libraries • No math :-) • In progress - Early Access Available • Code: • Solr as platform for enabling NLP • Open source

Why is Search and Text Analysis Important?

Quiz 1. Can you read this? 2. How about this? 3. How many emails did you get today? 4. How many websites/articles/books did you read today? 5. How many searches did you do? (Google, Y!, local, proprietary) 6. How much time did you spend doing all of these things? 7. How did you do it? Literally, what processes did you use?

Importance: Text is Hard! • “Numbers don’t lie!” • Unless you’re a politician? • Text “lies” all the time! • Even people disagree on it • Computer Expectations: • Be as good as people at a task (which isn’t perfect)

Importance: Info Overload • IDC estimates: • We generated 161 exabytes of digital info in 2006 • Even if a lot is non-textual, we deal with it by writing about it: • Tags, summaries, reports, closed captioning, etc. • Info Workers spend, in hours per week: • 14.5 hours reading and answering email • 13.3 creating documents • 9.6 searching for info • 9.5 analyzing info

Importance: Intelligent Web • The Web is driven by text • The current and future of the web is based on intelligence, as in • Human: a.k.a “The Masses” • Ratings, reviews, connections • Pros: dynamic, ad hoc or guided • Cons: sloppy, cheating, ad hoc • Artificial: • Pros: Can do a good/decent job on a lot of things, cheaper than manual • Cons: Can be really hard, not always as good as people • Examples: • Google, Y!, Amazon, Facebook, LinkedIn, Blogs, Wikipedia, many, many startups

Importance: Text and You • You: • Personal Organization • Email, IM, Docs • Importance, Prioritization, Organization • Career • Always work on hard problems • In-demand skill

Importance: Your Company • Make sense of disparate sources of info to gain competitive advantage • Reduce time/expense to understand large volumes of data • Enhance productivity • Mine connections/relationships

Foundations

Pieces of the Text Pie • Characters • Encoding, case, punctuation, accents, numbers • Tokens/Words • Segmentation, Parts of Speech, Stemming • Multi-word and Sentences • Phrases, parsing, sentence detection, co-reference resolution • Paragraphs • Summarization, meaning

Pieces II • Document • Meaning • Reading Level • Multi-document/Corpus • Summaries, similar docs • You/Author • Beliefs, knowledge, culture, training

Fields of Interest • Information Retrieval (IR) • Natural Language Processing (NLP) • Computational Linguistics • Math/Statistics • Artificial Intelligence • Biology

Search and Text Analysis in the Real World • Focus on Search • Most robust, but far from perfect • Integrate others into Search platform

Obtaining Text • It’s everywhere, but is it how we want it? • Crawl file/web • DBs • CMS • Many different file formats • Office, PDF, HTML, XML • Need to extract usable content

Extracting Text • Many open source tools exist: • PDFBox, POI, TextMining, SAX, DOM, StaX, nekoHTML, HTMLParser, etc. • Use a framework instead of one-offs for each tool • Common API for all tools • Aperture: http://aperture.sourceforge.net/ • Crawlers, extractors, RDF • Tika: http://incubator.apache.org/tika/ • SAX-like plus metadata

Text Applications • Find items that meet an information need • Identify important people, places, things • Fuzzy Strings • Categorization and Classification • Organize groups of documents • Answer questions • Much, much more: • Sentiment, Machine Translation, Summarization…

Search

Search Concepts • User inputs one or more keywords along with some operators and expects to get back a ranked list of documents relevant to the keywords • User sorts through the documents, reading/using those he thinks are most relevant • User’s relevant docs does not always equal search engines

Making Content Searchable • Search engines generally: • Extract Tokens from Content • Optionally transform said tokens depending on needs • Stemming • Expand with synonyms (usually done at query time) • Remove token (stopword) • Other Text Analysis • Add metadata • Store tokens and related metadata (position, etc.) in a data structured optimized for searching • Called an Inverted Index

Libraries • Apache Lucene • Apache Solr • Sphinx • Minion • Xapian

Apache Solr • Lucene-based Search server • HTTP-based, but many native clients • Lucene best practices • Replication/Distribution • Caching • Plug and Play extensions • http://lucene.apache.org/solr

People, Places, Things http://news.yahoo.com/s/ap/20081013/ap_on_sp_fo_ne/fbn_cowboys_romo_10

Named Entity Recognition • Identify people, places, things, numerical quantities • Approaches • Rule-based • Write rules to extract • Lists, gazetteers, others • Statistical • Annotate data and learn stats • Change domains, languages, etc.

Libraries • OpenNLP • Minor Third • Stanford NER • Mallet • LingPipe (dual license) • OpenCalais (dual)

OpenNLP • Maximum Entropy library • Parser • Chunker • Sentence Detection • NER

Fuzzy Strings • Spell checking • Record Matching • Address book merging • US Census • Document/Question Similarity • Log analysis • De-duplication

Strings • Algorithms • Edit Distance (Levenstein) • Jaro-Winkler • Many others • Libraries • Regular Expressions • Second String • Lucene Spell Checker (contrib)

Organization: Classification http://www.dmoz.org

C & C • Automatically label content based on one or more categories • Supervised • Unsupervised • Useful for: • Spam • Genre (sports, business, tech, etc.)

Libraries • Mahout • Naïve Bayes • Genetic • OpenNLP • libSVM • Neural Network implementations

Organization: Clustering

Clustering • Group similar content into clusters for easy browsing • Types: • Search Results • Documents • Data • Approaches: • K-Means • Mean-shift • Hierarchical

Libraries • Carrot2 (search results) • Many different approaches • Mahout • k-Means • Mean Shift • Canopy • SOLR-769 • https://issues.apache.org/jira/browse/SOLR-769 • Various others

Q & A http://www.answers.com/who%20is%20Bobby%20Orr%3F

Question Answering • Find the answer to a question • Phrase, sentence, passage, document(s) • Combination of a lot of the previous parts • Difficult • Easier • Who is John Wayne? • Hard (impossible?): • What are the pros and cons of the bailout package?

Libraries • QANDA • OpenEphyra • Taming Text • Future • Fact-based • Demo only • Some others

Resources • http://lucene.apache.org • /solr • /java • /mahout • http://opennlp.sourceforge.net • http://project.carrot2.org/ • gsingers@a.o • a.o ==apache.org

Demo • Download from • Unzip • cd apache-solr-1.3.0/example • java -jar start.jar • In another terminal, cd example/exampledocs • java -jar post.jar *.xml

d1 q1 Θ Vector Space Model • Goal: Identify documents that are similar to input query • Represent each word with a weight w • The words in the document and the query each define a Vector in an n-dimensional space • Common weighting scheme is called TF-IDF • TF = Term Frequency • IDF = Inverse Document Freq. • Intuition behind TF-IDF: • A term that frequently occurs in a few documents relative to the collection is more important than one that occurs in a lot of documents • Sim(q1, d1) = cos Θ dj= <w1,j,w2,j,…,wn,j> q= <w1,q,w2,q,…wn,q> w = weight assigned to term

Search And Text Analysis

Search And Text Analysis

Presentation Transcript

Text and Web Search

Pragmatics and Text Analysis

Beyond Search: Statistical Topic Models for Text Analysis

Web Search and Text Mining

Web Search and Text Mining

Pragmatics and Text Analysis

Web Search and Text Mining

Full Text Search

Text Analysis and History

Text Analysis and Ontologies

IBM Leadership in Search, Text Analysis and Classification

Web Search and Text Mining

Text-Analysis

Text Analysis and History

Beyond Search: Statistical Topic Models for Text Analysis

Text analysis

Text Search and Fuzzy Matching

Search Text-Field

Text Search and Fuzzy Matching

Web Search and Text Mining

TEXT ANALYSIS