Lecture 22: NLP for IR

Prof. Ray Larson University of California, Berkeley School of Information Lecture 22: NLP for IR Principles of Information Retrieval

Today • Review • Cheshire III Design – GRID-based DLs • NLP for IR • Text Summarization Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.) .…. High energy physics Chemical Engineering Climate Astrophysics Cosmology Combustion Applications Application Toolkits Grid Services Grid Fabric ..… Remote Computing Remote Visualization Collaboratories Remote sensors Data Grid Portals Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library Workshop) Digital Libraries High energy physics Bio-Medical Humanities computing Astrophysics Chemical Engineering Climate Cosmology Combustion … Applications Application Toolkits Grid Services Grid Fabric … Text Mining Remote Computing Remote Visualization Search & Retrieval Metadata management Collaboratories Remote sensors Data Grid Portals Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

Grid IR Issues • Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) • Very large-scale distribution of resources is a challenge for sub-second retrieval • Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive • In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

Context • Environmental Requirements: • Very Large scale information systems • Terabyte scale (Data Grid) • Computationally expensive processes (Comp. Grid) • Digital Preservation • Analysis of data, not just retrieval (Data/Text Mining) • Ease of Extensibility, Customizability (Python) • Open Source • Integrate not Re-implement • "Web 2.0" – interactivity and dynamic interfaces

Context Application Layer Digital Library Layer Data Grid Layer Data Mining Tools Text Mining Tools User Interface User Interface Orange, Weka, ... Tsujii Labs, ... Web BrowserMultivalent Dedicated Client MySRB PAWN Classification Clustering Natural Language Processing Information Extraction Results Query Information System Data Grid Protocol Handler Cheshire3 Store Query SRB iRODS Apache+ Mod_Python+ Cheshire3 Search / Retrieve Index / Store Results Results Query Document Parsers Term Management Process Management Process Management Multivalent,... Termine WordNet ... Kepler iRODS rules Kepler Cheshire3 Export Parse

Cheshire3 Object Model Ingest Process Documents Server Object Document Group ConfigStore Transformer Records User Document Query Database UserStore PreParser PreParser ResultSet PreParser Query Document Index Extracter RecordStore Parser Normaliser Terms DocumentStore IndexStore Protocol Handler Record

Object Configuration • One XML 'record' per non-data object • Very simple base schema, with extensions as needed • Identifiers for objects unique within a context(e.g., unique at individual database level, but not necessarily between all databases) • Allows workflows to reference by identifier but act appropriately within different contexts. • Allows multiple administrators to define objects without reference to each other

Grid • Focus on ingest, not discovery (yet) • Instantiate architecture on every node • Assign one node as master, rest as slaves. Master then divides the processing as appropriate. • Calls between slaves possible • Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) • Typically:('workflow-id', 'process', 'document-id')

Grid Architecture Master Task (workflow, process, document) (workflow, process, document) fetch document fetch document Data Grid document document Slave Task 1 Slave Task N extracted data extracted data GPFS Temporary Storage

Grid Architecture - Phase 2 Master Task (index, load) (index, load) store index store index Data Grid Slave Task 1 Slave Task N fetch extracted data fetch extracted data GPFS Temporary Storage

Workflow Objects • Written as XML within the configuration record. • Rewrites and compiles to Python code on object instantiation Current instructions: • object • assign • fork • for-each • break/continue • try/except/raise • return • log (= send text to default logger object) Yes, no if!

Workflow example <subConfig id=“buildSingleWorkflow”> <objectType>workflow.SimpleWorkflow</objectType> <workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>”Loaded Record:” + input.id</log> </workflow> </subConfig>

Text Mining • Integration of Natural Language Processing tools • Including: • Part of Speech taggers (noun, verb, adjective,...) • Phrase Extraction • Deep Parsing (subject, verb, object, preposition,...) • Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) • Planned: Information Extraction tools

Data Mining • Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes • Focus on automatic classification for predefined categories rather than clustering • Algorithms integrated/implemented: • Perceptron, Neural Network (pure python) • Naïve Bayes (pure python) • SVM (libsvm integrated with python wrapper) • Classification Association Rule Mining (Java)

Data Mining • Modelled as multi-stage PreParser object (training phase, prediction phase) • Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) • Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore • Document vectors generated per index per document, so integrated NLP document normalization for free

Data Mining + Text Mining • Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. • Computational grid for distributing expensive NLP analysis • Results show better accuracy with fewer attributes:

Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: • Deduplicate millions of records • Enrich deduplicated records against database of 45 million • Automatically reclassify enriched records using machine learning processes (Naïve Bayes)

Applications (1) • Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. • The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

Applications (2) Assessing the Grade Level of NSDL Education Material • The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. • Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. • We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). • This processing was done on the Teragrid cluster at SDSC.

Applications (2) • The formula for the Flesch Reading Ease Score: FRES = 206.835 –1.015 ((total words)/(total sentences)) – 84.6 ((total syllables)/(total words)) • The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) + 11.8 * ((total syllables)/(total words)) –15.59 • The Domain was determined by: • Domains used were based upon the AAAS Benchmarks • Taking in samples from each of the domain areas being examined and produces scored and ranked lists of vocabularies for each domain. • Each token in a document is passed through a lookup function against this table and tallies are calculated for the entire document. • These tallies are then used to rank the order of likelihood of the document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold.

Today • Natural Language Processing and IR • Based on Papers in Reader and on • David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996 • Text summarization: Lecture from Ed Hovy (USC)

Natural Language Processing and IR • The main approach in applying NLP to IR has been to attempt to address • Phrase usage vs individual terms • Search expansion using related terms/concepts • Attempts to automatically exploit or assign controlled vocabularies

NLP and IR • Much early research showed that (at least in the restricted test databases tested) • Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) • Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

NLP and IR • Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods • E.g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches

General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Pred: RUN Agent:John General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Pred: RUN Agent:John General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation John is a student. He runs. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

General Framework of NLP Tokenization Morphological and Lexical Processing Part of Speech Tagging Inflection/Derivation Compounding Syntactic Analysis Term recognition (Ananiadou) Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999 Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Open class words Terms Term recognition Named Entities Company names Locations Numerical expressions Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in English are ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Structural Ambiguities Predicate-argument Ambiguities Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a nice car. Semantic Ambiguities(2) Every man loves a woman. Co-reference Ambiguities Structural Ambiguities (1)Attachment Ambiguities John bought a carwith large seats. John bought a car with $3000. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith (2) Scope Ambiguities young women and men in the room (3)Analytical Ambiguities Visiting relatives can be boring. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Combinatorial Explosion Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Structural Ambiguities Predicate-argument Ambiguities Semantic Analysis Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester 95 % FSA rules Statistic taggers Part of Speech Tagger Local Context Statistical Bias F-Value 90 Domain Dependent General Framework of NLP Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word> Machine Learning: HMM, Decision Trees Rules + Machine Learning Context processing Interpretation

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names Morphological and Lexical Processing 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Semantic Anaysis Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.

Lecture 22: NLP for IR