270 likes | 291 Views
Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D kurt.godden@gm.com. Outline. Definitions of terms Customers (Who cares?) Finding Text – ontology-guided search Text Processing – Content extraction Text Mining Temporal Data Mining at GM
E N D
Ontology-Guided Search and Text Mining for Intelligence GatheringKurt Godden, Ph.D.MSR Lab, R&Dkurt.godden@gm.com 1
Outline • Definitions of terms • Customers (Who cares?) • Finding Text – ontology-guided search • Text Processing – • Content extraction • Text Mining • Temporal Data Mining at GM • Multi-Lingual Text Processing • Summary 2
What is Text Mining? • Data Mining: • The process of analyzing data to discover new patterns or relationships • 1st International Conference was KDD-95 • http://www-aig.jpl.nasa.gov/public/kdd95/ • Text Mining is Subfield of Data Mining • As such, ideally TM is the process of analyzing unstructured text to discover new patterns or relationships • In practice, TM often refers simply to the Content Extraction (CE) of structured data from unstructured text, usually from finite-state parsers. 3
Content Extraction:Structured Data from Unstructured Text “Company XYZ, is known to ship products through the port of Dubai.” From Text to Actionable Knowledge: Automatic multi- language scanning Entity and Relation extraction/distillation Filtering <XYZ-Corp,exports-through,Dubai> 4
Who Cares? • Government • NSA, CIA, DIA, DHS, DARPA • Industry • Automotive • Chemical • Pharmaceutical • Legal • Consumer goods • Aerospace 5
Why do they care? • Intelligence and Security • Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents. • http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf • Industrial • Urban Legend: (Is it true?) “80% of all corporate knowledge is in text.” • Market research • Fraud detection • Root cause analysis • Document clustering and categorization • Competitive intelligence • Patent analysis • etc 6
Before Mining Must Come Text • How to find it? 7
Ontology-Guided Search (OGS) • Oft-cited definition of ontology by T.R. Gruber: • An ontology is a formal specification of a shared conceptualization. • www.vivisimo.com clusters search results according to semantic categories • OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords 8
What ontology to use? • Public • Wordnet: http://wordnet.princeton.edu/ • Organizes content words (N,V,Adj,Adv) into sets of semantically-related concepts connected by relations • Currently 207k pairs of words-senses • <bank1, monetary institution> • <bank2, land adjacent to river> • Custom • Parts • Products • Processes • Tool: Protégé at http://protege.stanford.edu/ 9
Ontology-Guided Search (OGS) avoids neighborhood riot “driving through” avoiding neighborhoods riots “drive through” avoided suburb “civil unrest” “drove through” suburbs • Use ontology to search not only on keywords, but on semantically-related keywords 10
Pitfalls of OGS • Beware of semantically related terms • Simulation of OGS using Wordnet • Original query: • Which neighborhoods of Paris are safe? • One of several transformed queries was: • Which suburbs of Paris are condoms? 11
Content Extraction Technology • Regular Expressions Mapped to Semantic Templates • Regular Expression for Passives: NP1 BE TV [by NP2] “The lecture was presented by Kurt Godden” • Mapping of Match Registers to Template < NP2:agent, TV:relation, NP1:object> <kg, presented, lecture> Post-ProcessingRule: if NP2 is empty string, then use ‘someone’:agent 12
Content Extraction Example “Some 40 vehicles were torched in the Val d'Oise area NW of Paris.” http://www.breitbart.com/news/2005/11/04/D8DLFA780.html For pattern: NP1 BE TV [by NP2] ‘vehicles’ matches NP1 ‘were’ matches BE ‘torched’ matches TV No match for NP2 • Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn) <someone, burn, vehicle> • Additional triples can be matched by other RegExp patterns, giving: <vehicle, count, 40> <vehicle, located-in, val-d’oise> <val-d’oise, near, paris> 13
Why Only Regular Expressions? • Computational Efficiency • Practical Adequacy • Workaround for lack of recursion: Lots of RE’s ! NP → NP and NP becomes NP → CN and CN NP → CN and CN and CN NP → NAME and NAME NP → NAME and NAME and NAME 14
After Text Must Come Mining • Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore) • TDMiner • Proprietary tool • Discovers frequent sequences of events from symbolic data 15
For More Info: • 4th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data • http://www.kdd2006.com/workshops.html • Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517. 2005 19
Network Reconstruction • How to determine directed, acyclic graphs from sequential event data x z a n p g 20
Multilingual Problem • What if source text is not in English? 21
Machine Translation (MT) • Free, web-based tools not state-of-the-art e.g. http://babelfish.altavista.com/ • LanguageWeaver uses Statistical-Based MT Spin-off of USC Information Sciences Institute www.languageweaver.com 22
Hypothesis • Effective Content Extraction rules can be custom-developed for raw machine-translated text. 24
Summary • Text Mining Can Offer Real Value • Used Extensively by Gov’t Intel Agencies • Several COTS tools available for Content Extraction: • SAS Text Miner • AeroText (Lockheed Martin) • ClearForest • Attensity • etc.… • GATE – Univ. of Sheffield, open-source • http://gate.ac.uk/ 25