1 / 25

Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D kurt.godden@gm.c

Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D kurt.godden@gm.com. Outline. Definitions of terms Customers (Who cares?) Finding Text – ontology-guided search Text Processing – Content extraction Text Mining Temporal Data Mining at GM

galiena
Download Presentation

Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D kurt.godden@gm.c

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontology-Guided Search and Text Mining for Intelligence GatheringKurt Godden, Ph.D.MSR Lab, R&Dkurt.godden@gm.com 1

  2. Outline • Definitions of terms • Customers (Who cares?) • Finding Text – ontology-guided search • Text Processing – • Content extraction • Text Mining • Temporal Data Mining at GM • Multi-Lingual Text Processing • Summary 2

  3. What is Text Mining? • Data Mining: • The process of analyzing data to discover new patterns or relationships • 1st International Conference was KDD-95 • http://www-aig.jpl.nasa.gov/public/kdd95/ • Text Mining is Subfield of Data Mining • As such, ideally TM is the process of analyzing unstructured text to discover new patterns or relationships • In practice, TM often refers simply to the Content Extraction (CE) of structured data from unstructured text, usually from finite-state parsers. 3

  4. Content Extraction:Structured Data from Unstructured Text “Company XYZ, is known to ship products through the port of Dubai.” From Text to Actionable Knowledge: Automatic multi- language scanning Entity and Relation extraction/distillation Filtering <XYZ-Corp,exports-through,Dubai> 4

  5. Who Cares? • Government • NSA, CIA, DIA, DHS, DARPA • Industry • Automotive • Chemical • Pharmaceutical • Legal • Consumer goods • Aerospace 5

  6. Why do they care? • Intelligence and Security • Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents. • http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf • Industrial • Urban Legend: (Is it true?) “80% of all corporate knowledge is in text.” • Market research • Fraud detection • Root cause analysis • Document clustering and categorization • Competitive intelligence • Patent analysis • etc 6

  7. Before Mining Must Come Text • How to find it? 7

  8. Ontology-Guided Search (OGS) • Oft-cited definition of ontology by T.R. Gruber: • An ontology is a formal specification of a shared conceptualization. • www.vivisimo.com clusters search results according to semantic categories • OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords 8

  9. What ontology to use? • Public • Wordnet: http://wordnet.princeton.edu/ • Organizes content words (N,V,Adj,Adv) into sets of semantically-related concepts connected by relations • Currently  207k pairs of words-senses • <bank1, monetary institution> • <bank2, land adjacent to river> • Custom • Parts • Products • Processes • Tool: Protégé at http://protege.stanford.edu/ 9

  10. Ontology-Guided Search (OGS) avoids neighborhood riot “driving through” avoiding neighborhoods riots “drive through” avoided suburb “civil unrest” “drove through” suburbs • Use ontology to search not only on keywords, but on semantically-related keywords 10

  11. Pitfalls of OGS • Beware of semantically related terms • Simulation of OGS using Wordnet • Original query: • Which neighborhoods of Paris are safe? • One of several transformed queries was: • Which suburbs of Paris are condoms? 11

  12. Content Extraction Technology • Regular Expressions Mapped to Semantic Templates • Regular Expression for Passives: NP1 BE TV [by NP2] “The lecture was presented by Kurt Godden” • Mapping of Match Registers to Template < NP2:agent, TV:relation, NP1:object> <kg, presented, lecture> Post-ProcessingRule: if NP2 is empty string, then use ‘someone’:agent 12

  13. Content Extraction Example “Some 40 vehicles were torched in the Val d'Oise area NW of Paris.” http://www.breitbart.com/news/2005/11/04/D8DLFA780.html For pattern: NP1 BE TV [by NP2] ‘vehicles’ matches NP1 ‘were’ matches BE ‘torched’ matches TV No match for NP2 • Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn) <someone, burn, vehicle> • Additional triples can be matched by other RegExp patterns, giving: <vehicle, count, 40> <vehicle, located-in, val-d’oise> <val-d’oise, near, paris> 13

  14. Why Only Regular Expressions? • Computational Efficiency • Practical Adequacy • Workaround for lack of recursion: Lots of RE’s ! NP → NP and NP becomes NP → CN and CN NP → CN and CN and CN NP → NAME and NAME NP → NAME and NAME and NAME 14

  15. After Text Must Come Mining • Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore) • TDMiner • Proprietary tool • Discovers frequent sequences of events from symbolic data 15

  16. 16

  17. 17

  18. 18

  19. For More Info: • 4th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data • http://www.kdd2006.com/workshops.html • Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517. 2005 19

  20. Network Reconstruction • How to determine directed, acyclic graphs from sequential event data x z a n p g 20

  21. Multilingual Problem • What if source text is not in English? 21

  22. Machine Translation (MT) • Free, web-based tools not state-of-the-art e.g. http://babelfish.altavista.com/ • LanguageWeaver uses Statistical-Based MT Spin-off of USC Information Sciences Institute www.languageweaver.com 22

  23. 23

  24. Hypothesis • Effective Content Extraction rules can be custom-developed for raw machine-translated text. 24

  25. Summary • Text Mining Can Offer Real Value • Used Extensively by Gov’t Intel Agencies • Several COTS tools available for Content Extraction: • SAS Text Miner • AeroText (Lockheed Martin) • ClearForest • Attensity • etc.… • GATE – Univ. of Sheffield, open-source • http://gate.ac.uk/ 25

More Related