1 / 48

Leveraging the Unstructured Data

Presented at ICE 2002 Edmonton AB, Canada October 22, 2002. Leveraging the Unstructured Data. Kas Kasravi EDS Fellow. Topics. Summary What is Unstructured Data? The Value in Unstructured Data Technologies Text Mining Audio Mining Image Mining Unstructured Data Management Issues

Download Presentation

Leveraging the Unstructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented at ICE 2002 Edmonton AB, Canada October 22, 2002 Leveraging the Unstructured Data Kas Kasravi EDS Fellow

  2. Topics • Summary • What is Unstructured Data? • The Value in Unstructured Data • Technologies • Text Mining • Audio Mining • Image Mining • Unstructured Data Management Issues • Examples / Demos

  3. Summary • Unstructured data consists of text, audio, images etc. • Technologies and tools exist for leveraging the value in unstructured data • Unstructured data contains significant business value • The value in unstructured data is mostly untapped • A paradigm shift is needed

  4. What is Unstructured Data? • Any data without a well-defined model for information access • Examples, • Word documents • E-mails • Examples of what is structured • Database tables • Objects • XML tags

  5. Unstructured Data Management (UDM) • The process of mining and analyzing unstructured data to capture actionable information • Market size • $100M for text mining • Much greater IT impact

  6. The Value in Unstructured Data • “Amount of text-based data alone will grow to over 800 terabytes by 2004” - Forrester • “Amount of unstructured data in large corporations doubles every 2 months” - IDC • Companies with a UDM system in place are, on average, at least 15% more productive - Basex

  7. The Value in Unstructured Data • “The average knowledge worker spends 2.5 hours per day searching for documents” - IDC – March 2002 • “80-90% of information on the net and corporate networks is unstructured” - Goldman Sachs Only if we could know what we know

  8. UDM Increases Informational Content + Structured Data (10-40%) Unstructured Data (60-90%)

  9. UDM Complements Structured Data Consolidated Data Structured Data Greater Value Contextual Information UDM Unstructured Data

  10. The Value in Unstructured Data • Business Value • Better information • More timely information • More relevant information • Better decision support • IT Impact • More information to store and manage • More complex analysis • Great business impact Source: META Group, 9/20/2001

  11. UDM Technologies

  12. Text Mining • The process of extracting information from textual data, and utilizing it for better business decisions • Based on multiple technologies, e.g., • Computational linguistics • Statistics • A new business intelligence tool • Focus on semantics and not keywords • An emerging technology

  13. Computational Linguistics • Definition • Study of computer algorithms for: • Natural language understanding • Natural language generation • Objectives • Machine translation • Information retrieval • Human-Machine interface • Early work began in 1950s

  14. Syntax Analysis • Structure Determination: Generation of a parse tree using a grammar • Regularizing the syntactic structure • Restricting large number of possible structures to a small number Sentence Subject Verb Phrase Verb Object Mary eats cheese

  15. Semantic Analysis • Example of ambiguities at the syntactic level: “I saw a man in the park with a telescope” • Semantic analysis is required • Synonyms • Deep parsing • Prior knowledge

  16. Categories of Text-Mining • Feature Extraction • Entities (e.g., names, companies, places) • Events (e.g., mergers, elections, sales) • Relations among entities and events • Document Categorization • Grouping multiple articles based on their contextual similarities • Summarization • A condensed version of one or more documents • Thematic Analysis • Discovery of the theme/context within a document

  17. Example of Feature Extraction Document Extracted Information Profits at Canada ’ s six big banks Event : Profits topped C $ 6 topped C$6 billion ($4.4 billion) Country: Canada in 1996, smashing last year ’s C$5.2 Entity : B ig banks billion ($3.8 billion) record as Organization : Canad ian Imperial Canad ian Imperial Bank of Commerce Bank of Comme rce and National Bank of Canada wrapped Organization : National Bank of up the earnings season Thursda y. Canada The six banks each reported a Date: Earnings season double - digit jump in net income for Date: Fiscal 1996 a combined profit of C$6.26 billion ($4.6 billion) in fiscal 1996 ended Oct. 31.

  18. Sources of Text • Textual documents • Corporate intranets • News • Chat rooms • Web pages • E-mails • Faxes • etc.

  19. News analysis for evidence gathering Patent analysis E-mail routing Competitive intelligence Warranty claims analysis CRM Content management Market research Recruiting eLearning Automated help-desks Chat room monitoring Web page monitoring Document clustering Legacy document conversion Machine translation Knowledge management Intelligent search engines e-Procurement Sample Applications

  20. Audio Mining • Analysis of audio data • Speech • Music • Other sounds • Goal: Extract information from audio • Who is the speaker • What is said • Defect detection • Music identification • Sonar object recognition • Telecommunications monitoring

  21. Audio Mining • Analysis is based on audio attributes, e.g., • Volume • Pitch • Timber • Sources of audio for analysis • Voice recordings • Factory sounds • Telecommunications • Broadcasts • etc.

  22. Sample Applications • Broadcast content management • Call center automation • CRM • Manufacturing quality control • Music retrieval (query by humming) • Security • etc.

  23. Image Mining • Analysis of digital images • Pictures • Drawings • Videos • Goal: Extract information from images • Face recognition • Defect detection • Object recognition • Action/event detection

  24. Image Mining • Analysis is based on spatial attributes e.g., • Color • Size • Texture (macro and micro) • Shapes/outlines/shadows • Sources of images • Digital photographs • Surveillance cameras • Satellite images • Broadcasts • etc.

  25. Sample Applications • Manufacturing quality control • Broadcast content management • Remote Sensing • Security and authentication • Forensics • Video logs • Geophysics • Aerial Photogrammetry • etc.

  26. Unstructured Data Management Issues • Metrics • Commercial Tools • Related Technologies • Challenges

  27. Metrics • Accuracy • Percentage of extracted information that is correct • Thoroughness • Percentage of facts extracted that were present • Focus • Percentage of extracted information that is relevant and useful

  28. Sample UDM Tools* • APR/Smartlogik • Autonomy • Clairvoyance • ClearForest • Entrieva • Insightful • MAMI • Megaputer * Partial list of products. The author does not recommend any products.

  29. Related Technologies • Business Intelligence • Knowledge Management • Content Management • eLearning • Collaboration • Innovation Management • Sales Force Automation • Data Mining • Visualization

  30. Sample Architecture Structured Data Rules Data Warehouse Information Extraction Unstructured Data Analysis and Decision Support Visualization

  31. Challenges • Paradigm Shift • Which applications? What to analyze? • Where’s the ROI? What are the risks? • Business Readiness • Technology Maturity • Ambiguity Resolution • Uniqueness • Timeliness • Context • Testing • Efficacy • Adverse Reaction

  32. Bibliography • Computational Auditory Scene Analysis, by David F. Rosenthal (Editor), Hiroshi G. Okuno (Editor) • Computational Linguistics – An introduction, by Ralph Grishman, Cambridge University Press • Elements of Photogrammetry with Applications in GIS, by Paul R. Wolf, Bon A. Dewitt • Emerging Solutions for Managing Unstructured CRM Data, by Richard Peynot (Giga Information group) • Foundations of Statistical Natural Language Processing, by Christopher D. Manning, Hinrich Schutze • Proceedings of ACM – SIGKDD: Knowledge Discovery and Data Mining Conference, 1999

  33. Examples / Demos

  34. Examples / Demos • EDS - Bank of Knowledge • BBC - Neon • ClearForest – Patent Analysis • Insightful – Aerial Photogrammetry* • EDS – Securities Fraud Detection* * Not included in handouts and files

  35. Bank of Knowledge EDS Global Purchasing

  36. EDS Bank of Knowledge Project • Supply chain intelligence • EDS has 35,000+ active supplier contracts • Not possible to read, understand, and utilize all the terms in those contracts • Revenue and cost reduction opportunities are lost because some contractual terms are not known and not enforced, e.g., • Discounts offer cost reductions • Refunds offer revenue generation

  37. Features • Modules • Spend Management • Compliance Management • Supplier Intelligence • Contracts Management • Seamless integration of technologies • Text-mining • Data mining • Business intelligence • Advanced visualization

  38. Pricing Discounts Margins Pro-rata Refunds Levels of Support Sample Contract Attributes • License Clauses • Contract Amendments • Confidentiality • Warranty Information • Freight Information BOK correlates above attributes with other procurement functions

  39. Metrics • $4000+ average cost to manually create, execute, manage and track a single contract • Average $2,000 savings per contract reviewed/enforced • 3-5% cost savings based on addressable spend patterns • 12-15% improved productivity • ROI in about 6 months of usage

  40. BBC Neon APR/Smartlogik

  41. BBC Neon • NEws information Online • A BBC application developed by APR/Smartlogik • Robust online news archive solution • Concept-based news search engine • 5000 users • Provides a varied selection of news publications • Retention of the BBC's existing taxonomy combined with the flexibility to update this structure Example Courtesy APR and BBC

  42. Patent Analysis ClearForest

  43. Patent Analysis • Licensing • Development • Asset Evaluation • Recruiting opportunities • Prosecution • Litigation

  44. Example – Patent Analysis with ClearForest • Most referenced patents

  45. Most active inventors

  46. Most active companies

  47. Fuel cell inventors working together

  48. Link to any patent document, highlighting key words

More Related