410 likes | 871 Views
Search And Text Analysis. An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG. Overview. Background Taming Text Importance Foundations Language Basics Obtaining Text Tools for Search and Text Analysis Concepts Demos Resources.
E N D
Search And Text Analysis An Introduction to Java-based Open Source Tools and Techniques Grant Ingersoll October 15, 2008 Charlotte JUG
Overview • Background • Taming Text • Importance • Foundations • Language Basics • Obtaining Text • Tools for Search and Text Analysis • Concepts • Demos • Resources
Taming Text • http://www.manning.com/ingersoll • Grant Ingersoll • Lucene/Solr/Mahout committer • Tom Morton • OpenNLP author • Practical aspects of Text Analysis • Open Source libraries • No math :-) • In progress - Early Access Available • Code: • Solr as platform for enabling NLP • Open source
Quiz 1. Can you read this? 2. How about this? 3. How many emails did you get today? 4. How many websites/articles/books did you read today? 5. How many searches did you do? (Google, Y!, local, proprietary) 6. How much time did you spend doing all of these things? 7. How did you do it? Literally, what processes did you use?
Importance: Text is Hard! • “Numbers don’t lie!” • Unless you’re a politician? • Text “lies” all the time! • Even people disagree on it • Computer Expectations: • Be as good as people at a task (which isn’t perfect)
Importance: Info Overload • IDC estimates: • We generated 161 exabytes of digital info in 2006 • Even if a lot is non-textual, we deal with it by writing about it: • Tags, summaries, reports, closed captioning, etc. • Info Workers spend, in hours per week: • 14.5 hours reading and answering email • 13.3 creating documents • 9.6 searching for info • 9.5 analyzing info
Importance: Intelligent Web • The Web is driven by text • The current and future of the web is based on intelligence, as in • Human: a.k.a “The Masses” • Ratings, reviews, connections • Pros: dynamic, ad hoc or guided • Cons: sloppy, cheating, ad hoc • Artificial: • Pros: Can do a good/decent job on a lot of things, cheaper than manual • Cons: Can be really hard, not always as good as people • Examples: • Google, Y!, Amazon, Facebook, LinkedIn, Blogs, Wikipedia, many, many startups
Importance: Text and You • You: • Personal Organization • Email, IM, Docs • Importance, Prioritization, Organization • Career • Always work on hard problems • In-demand skill
Importance: Your Company • Make sense of disparate sources of info to gain competitive advantage • Reduce time/expense to understand large volumes of data • Enhance productivity • Mine connections/relationships
Pieces of the Text Pie • Characters • Encoding, case, punctuation, accents, numbers • Tokens/Words • Segmentation, Parts of Speech, Stemming • Multi-word and Sentences • Phrases, parsing, sentence detection, co-reference resolution • Paragraphs • Summarization, meaning
Pieces II • Document • Meaning • Reading Level • Multi-document/Corpus • Summaries, similar docs • You/Author • Beliefs, knowledge, culture, training
Fields of Interest • Information Retrieval (IR) • Natural Language Processing (NLP) • Computational Linguistics • Math/Statistics • Artificial Intelligence • Biology
Search and Text Analysis in the Real World • Focus on Search • Most robust, but far from perfect • Integrate others into Search platform
Obtaining Text • It’s everywhere, but is it how we want it? • Crawl file/web • DBs • CMS • Many different file formats • Office, PDF, HTML, XML • Need to extract usable content
Extracting Text • Many open source tools exist: • PDFBox, POI, TextMining, SAX, DOM, StaX, nekoHTML, HTMLParser, etc. • Use a framework instead of one-offs for each tool • Common API for all tools • Aperture: http://aperture.sourceforge.net/ • Crawlers, extractors, RDF • Tika: http://incubator.apache.org/tika/ • SAX-like plus metadata
Text Applications • Find items that meet an information need • Identify important people, places, things • Fuzzy Strings • Categorization and Classification • Organize groups of documents • Answer questions • Much, much more: • Sentiment, Machine Translation, Summarization…
Search Concepts • User inputs one or more keywords along with some operators and expects to get back a ranked list of documents relevant to the keywords • User sorts through the documents, reading/using those he thinks are most relevant • User’s relevant docs does not always equal search engines
Making Content Searchable • Search engines generally: • Extract Tokens from Content • Optionally transform said tokens depending on needs • Stemming • Expand with synonyms (usually done at query time) • Remove token (stopword) • Other Text Analysis • Add metadata • Store tokens and related metadata (position, etc.) in a data structured optimized for searching • Called an Inverted Index
Libraries • Apache Lucene • Apache Solr • Sphinx • Minion • Xapian
Apache Solr • Lucene-based Search server • HTTP-based, but many native clients • Lucene best practices • Replication/Distribution • Caching • Plug and Play extensions • http://lucene.apache.org/solr
People, Places, Things http://news.yahoo.com/s/ap/20081013/ap_on_sp_fo_ne/fbn_cowboys_romo_10
Named Entity Recognition • Identify people, places, things, numerical quantities • Approaches • Rule-based • Write rules to extract • Lists, gazetteers, others • Statistical • Annotate data and learn stats • Change domains, languages, etc.
Libraries • OpenNLP • Minor Third • Stanford NER • Mallet • LingPipe (dual license) • OpenCalais (dual)
OpenNLP • Maximum Entropy library • Parser • Chunker • Sentence Detection • NER
Fuzzy Strings • Spell checking • Record Matching • Address book merging • US Census • Document/Question Similarity • Log analysis • De-duplication
Strings • Algorithms • Edit Distance (Levenstein) • Jaro-Winkler • Many others • Libraries • Regular Expressions • Second String • Lucene Spell Checker (contrib)
Organization: Classification http://www.dmoz.org
C & C • Automatically label content based on one or more categories • Supervised • Unsupervised • Useful for: • Spam • Genre (sports, business, tech, etc.)
Libraries • Mahout • Naïve Bayes • Genetic • OpenNLP • libSVM • Neural Network implementations
Clustering • Group similar content into clusters for easy browsing • Types: • Search Results • Documents • Data • Approaches: • K-Means • Mean-shift • Hierarchical
Libraries • Carrot2 (search results) • Many different approaches • Mahout • k-Means • Mean Shift • Canopy • SOLR-769 • https://issues.apache.org/jira/browse/SOLR-769 • Various others
Q & A http://www.answers.com/who%20is%20Bobby%20Orr%3F
Question Answering • Find the answer to a question • Phrase, sentence, passage, document(s) • Combination of a lot of the previous parts • Difficult • Easier • Who is John Wayne? • Hard (impossible?): • What are the pros and cons of the bailout package?
Libraries • QANDA • OpenEphyra • Taming Text • Future • Fact-based • Demo only • Some others
Resources • http://lucene.apache.org • /solr • /java • /mahout • http://opennlp.sourceforge.net • http://project.carrot2.org/ • gsingers@a.o • a.o ==apache.org
Demo • Download from • Unzip • cd apache-solr-1.3.0/example • java -jar start.jar • In another terminal, cd example/exampledocs • java -jar post.jar *.xml
d1 q1 Θ Vector Space Model • Goal: Identify documents that are similar to input query • Represent each word with a weight w • The words in the document and the query each define a Vector in an n-dimensional space • Common weighting scheme is called TF-IDF • TF = Term Frequency • IDF = Inverse Document Freq. • Intuition behind TF-IDF: • A term that frequently occurs in a few documents relative to the collection is more important than one that occurs in a lot of documents • Sim(q1, d1) = cos Θ dj= <w1,j,w2,j,…,wn,j> q= <w1,q,w2,q,…wn,q> w = weight assigned to term