480 likes | 606 Views
Presented at ICE 2002 Edmonton AB, Canada October 22, 2002. Leveraging the Unstructured Data. Kas Kasravi EDS Fellow. Topics. Summary What is Unstructured Data? The Value in Unstructured Data Technologies Text Mining Audio Mining Image Mining Unstructured Data Management Issues
E N D
Presented at ICE 2002 Edmonton AB, Canada October 22, 2002 Leveraging the Unstructured Data Kas Kasravi EDS Fellow
Topics • Summary • What is Unstructured Data? • The Value in Unstructured Data • Technologies • Text Mining • Audio Mining • Image Mining • Unstructured Data Management Issues • Examples / Demos
Summary • Unstructured data consists of text, audio, images etc. • Technologies and tools exist for leveraging the value in unstructured data • Unstructured data contains significant business value • The value in unstructured data is mostly untapped • A paradigm shift is needed
What is Unstructured Data? • Any data without a well-defined model for information access • Examples, • Word documents • E-mails • Examples of what is structured • Database tables • Objects • XML tags
Unstructured Data Management (UDM) • The process of mining and analyzing unstructured data to capture actionable information • Market size • $100M for text mining • Much greater IT impact
The Value in Unstructured Data • “Amount of text-based data alone will grow to over 800 terabytes by 2004” - Forrester • “Amount of unstructured data in large corporations doubles every 2 months” - IDC • Companies with a UDM system in place are, on average, at least 15% more productive - Basex
The Value in Unstructured Data • “The average knowledge worker spends 2.5 hours per day searching for documents” - IDC – March 2002 • “80-90% of information on the net and corporate networks is unstructured” - Goldman Sachs Only if we could know what we know
UDM Increases Informational Content + Structured Data (10-40%) Unstructured Data (60-90%)
UDM Complements Structured Data Consolidated Data Structured Data Greater Value Contextual Information UDM Unstructured Data
The Value in Unstructured Data • Business Value • Better information • More timely information • More relevant information • Better decision support • IT Impact • More information to store and manage • More complex analysis • Great business impact Source: META Group, 9/20/2001
Text Mining • The process of extracting information from textual data, and utilizing it for better business decisions • Based on multiple technologies, e.g., • Computational linguistics • Statistics • A new business intelligence tool • Focus on semantics and not keywords • An emerging technology
Computational Linguistics • Definition • Study of computer algorithms for: • Natural language understanding • Natural language generation • Objectives • Machine translation • Information retrieval • Human-Machine interface • Early work began in 1950s
Syntax Analysis • Structure Determination: Generation of a parse tree using a grammar • Regularizing the syntactic structure • Restricting large number of possible structures to a small number Sentence Subject Verb Phrase Verb Object Mary eats cheese
Semantic Analysis • Example of ambiguities at the syntactic level: “I saw a man in the park with a telescope” • Semantic analysis is required • Synonyms • Deep parsing • Prior knowledge
Categories of Text-Mining • Feature Extraction • Entities (e.g., names, companies, places) • Events (e.g., mergers, elections, sales) • Relations among entities and events • Document Categorization • Grouping multiple articles based on their contextual similarities • Summarization • A condensed version of one or more documents • Thematic Analysis • Discovery of the theme/context within a document
Example of Feature Extraction Document Extracted Information Profits at Canada ’ s six big banks Event : Profits topped C $ 6 topped C$6 billion ($4.4 billion) Country: Canada in 1996, smashing last year ’s C$5.2 Entity : B ig banks billion ($3.8 billion) record as Organization : Canad ian Imperial Canad ian Imperial Bank of Commerce Bank of Comme rce and National Bank of Canada wrapped Organization : National Bank of up the earnings season Thursda y. Canada The six banks each reported a Date: Earnings season double - digit jump in net income for Date: Fiscal 1996 a combined profit of C$6.26 billion ($4.6 billion) in fiscal 1996 ended Oct. 31.
Sources of Text • Textual documents • Corporate intranets • News • Chat rooms • Web pages • E-mails • Faxes • etc.
News analysis for evidence gathering Patent analysis E-mail routing Competitive intelligence Warranty claims analysis CRM Content management Market research Recruiting eLearning Automated help-desks Chat room monitoring Web page monitoring Document clustering Legacy document conversion Machine translation Knowledge management Intelligent search engines e-Procurement Sample Applications
Audio Mining • Analysis of audio data • Speech • Music • Other sounds • Goal: Extract information from audio • Who is the speaker • What is said • Defect detection • Music identification • Sonar object recognition • Telecommunications monitoring
Audio Mining • Analysis is based on audio attributes, e.g., • Volume • Pitch • Timber • Sources of audio for analysis • Voice recordings • Factory sounds • Telecommunications • Broadcasts • etc.
Sample Applications • Broadcast content management • Call center automation • CRM • Manufacturing quality control • Music retrieval (query by humming) • Security • etc.
Image Mining • Analysis of digital images • Pictures • Drawings • Videos • Goal: Extract information from images • Face recognition • Defect detection • Object recognition • Action/event detection
Image Mining • Analysis is based on spatial attributes e.g., • Color • Size • Texture (macro and micro) • Shapes/outlines/shadows • Sources of images • Digital photographs • Surveillance cameras • Satellite images • Broadcasts • etc.
Sample Applications • Manufacturing quality control • Broadcast content management • Remote Sensing • Security and authentication • Forensics • Video logs • Geophysics • Aerial Photogrammetry • etc.
Unstructured Data Management Issues • Metrics • Commercial Tools • Related Technologies • Challenges
Metrics • Accuracy • Percentage of extracted information that is correct • Thoroughness • Percentage of facts extracted that were present • Focus • Percentage of extracted information that is relevant and useful
Sample UDM Tools* • APR/Smartlogik • Autonomy • Clairvoyance • ClearForest • Entrieva • Insightful • MAMI • Megaputer * Partial list of products. The author does not recommend any products.
Related Technologies • Business Intelligence • Knowledge Management • Content Management • eLearning • Collaboration • Innovation Management • Sales Force Automation • Data Mining • Visualization
Sample Architecture Structured Data Rules Data Warehouse Information Extraction Unstructured Data Analysis and Decision Support Visualization
Challenges • Paradigm Shift • Which applications? What to analyze? • Where’s the ROI? What are the risks? • Business Readiness • Technology Maturity • Ambiguity Resolution • Uniqueness • Timeliness • Context • Testing • Efficacy • Adverse Reaction
Bibliography • Computational Auditory Scene Analysis, by David F. Rosenthal (Editor), Hiroshi G. Okuno (Editor) • Computational Linguistics – An introduction, by Ralph Grishman, Cambridge University Press • Elements of Photogrammetry with Applications in GIS, by Paul R. Wolf, Bon A. Dewitt • Emerging Solutions for Managing Unstructured CRM Data, by Richard Peynot (Giga Information group) • Foundations of Statistical Natural Language Processing, by Christopher D. Manning, Hinrich Schutze • Proceedings of ACM – SIGKDD: Knowledge Discovery and Data Mining Conference, 1999
Examples / Demos • EDS - Bank of Knowledge • BBC - Neon • ClearForest – Patent Analysis • Insightful – Aerial Photogrammetry* • EDS – Securities Fraud Detection* * Not included in handouts and files
Bank of Knowledge EDS Global Purchasing
EDS Bank of Knowledge Project • Supply chain intelligence • EDS has 35,000+ active supplier contracts • Not possible to read, understand, and utilize all the terms in those contracts • Revenue and cost reduction opportunities are lost because some contractual terms are not known and not enforced, e.g., • Discounts offer cost reductions • Refunds offer revenue generation
Features • Modules • Spend Management • Compliance Management • Supplier Intelligence • Contracts Management • Seamless integration of technologies • Text-mining • Data mining • Business intelligence • Advanced visualization
Pricing Discounts Margins Pro-rata Refunds Levels of Support Sample Contract Attributes • License Clauses • Contract Amendments • Confidentiality • Warranty Information • Freight Information BOK correlates above attributes with other procurement functions
Metrics • $4000+ average cost to manually create, execute, manage and track a single contract • Average $2,000 savings per contract reviewed/enforced • 3-5% cost savings based on addressable spend patterns • 12-15% improved productivity • ROI in about 6 months of usage
BBC Neon APR/Smartlogik
BBC Neon • NEws information Online • A BBC application developed by APR/Smartlogik • Robust online news archive solution • Concept-based news search engine • 5000 users • Provides a varied selection of news publications • Retention of the BBC's existing taxonomy combined with the flexibility to update this structure Example Courtesy APR and BBC
Patent Analysis ClearForest
Patent Analysis • Licensing • Development • Asset Evaluation • Recruiting opportunities • Prosecution • Litigation
Example – Patent Analysis with ClearForest • Most referenced patents