A Survey of Approaches on Mining the Structure from Unstructured Data

A Survey of Approaches on Mining the Structure from Unstructured Data Dutch-Belgian Database Day 2009 (DBDBD 2009)

Introduction • A lot of data is generated every day • Difficult to find information that meets one’s needs • There is a need to mine the structure of data as a first step towards understanding it • Part of the effort to make the Web machine-understandable • Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language Dutch-Belgian Database Day 2009 (DBDBD 2009)

Which Technique to Choose? Dutch-Belgian Database Day 2009 (DBDBD 2009)

Statistics-Based NLP (1) • Utilize statistics and mathematical models based on probability theory • Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including: • Probabilistic modeling • Information theory • Linear algebra • Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations Dutch-Belgian Database Day 2009 (DBDBD 2009)

Statistics-Based NLP (2) • Word-based: • Statistics collection on words • Frequency counting and ranking generation (e.g., TF-IDF) • Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.) • Word Sense Disambiguation (WSD) • Inference models: n-grams • Clustering • Grammar-based: • Part-Of-Speech (POS) tagging • Stochastic Context-Free Grammars (SCFG) Dutch-Belgian Database Day 2009 (DBDBD 2009)

Statistics-Based NLP (3) • Advantages: • Not based on knowledge, thus they do not require linguistic resources, nor do they require expert knowledge • Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated • Disadvantages: • Often need a large amount of data • Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics Dutch-Belgian Database Day 2009 (DBDBD 2009)

Statistics-Based NLP (4) • Examples: • (Bannard et al., 2003) discuss several techniques for using statistical models acquired from corpus data to infer the meaning of verb-particle constructions: • Collocation-like approach, frequency counting • Focus on mining relations between words • (Taira and Soderland, 1999) implement a statistical natural language processor: • Based on resonance probabilities between word pairs • Uses word affinity knowledge from training sentences • Focus on acquiring knowledge from radiology reports Dutch-Belgian Database Day 2009 (DBDBD 2009)

Pattern-Based NLP (1) • Use linguistic patterns to extract data from texts • Patterns can be: • Predefined • Discovered (learned) • Knowledge used: • Lexical knowledge • Syntactic knowledge • Semantic knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)

Pattern-Based NLP (2) • Lexico-syntactic patterns: • Combine lexical and syntactic elements with regular expressions • E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons • Lexico-semantic patterns: • Enrich lexico-syntactic patterns through the addition of semantics • Gazetteers (simple typing): • Use linguistic meaning of text • E.g., “[sub:company]announces collaboration with[obj:company]” • Ontologies (complex typing): • Include also relationships • E.g., “[kb:Company]kb:collaborates[kb:Company]” Dutch-Belgian Database Day 2009 (DBDBD 2009)

Pattern-Based NLP (3) • Advantages: • Need less training data • Complex expressions can be defined • Results are easily interpretable • Disadvantages: • Lexical knowledge is required • Prior expert/domain knowledge might be required (for lexico-semantic patterns) • Defining and maintaining patterns is a cumbersome and non-trivial task Dutch-Belgian Database Day 2009 (DBDBD 2009)

Pattern-Based NLP (4) • Examples: • CAFETIERE (Black et al., 2005): • Employs extraction rules defined at lexico-semantic level • Makes use of gazetteering • Knowledge is stored using Narrative Knowledge Representation Language (NKRL) • Knowledge base lacks reasoning support • Focus on extracting relations from corpora • Hermes (Frasincar et al., 2009): • Patterns defined at lexico-semantic level • Makes use of ontologies and reasoning engines • Knowledge is based on an OWL domain ontology • Focus on the use of pattern-based NLP in building personalized news services Dutch-Belgian Database Day 2009 (DBDBD 2009)

Hybrid NLP (1) • Combine linguistic knowledge with statistical methods • Usually, it appears to be difficult to stay within the boundaries of a single approach • Thus, it is convenient to combine best from both worlds: • Bootstrapping lexical methods • Solving lack of expert knowledge by applying statistical methods • Statistical methods that use some present (lexical) knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)

Hybrid NLP (2) • Advantages: • Solve problems related to scaling and required expert knowledge of pattern-based approaches • Do not require as much data as statistical approaches • Inherit some of the advantages of both statistical and pattern-based approaches • Disadvantages: • By combining different techniques, maintaining completeness and accuracy of the systems becomes more difficult • Multidisciplinary aspects • Inherit some of the disadvantages of both statistical and pattern-based approaches Dutch-Belgian Database Day 2009 (DBDBD 2009)

Hybrid NLP (3) • Examples: • Corpus-Based Statistics-Oriented techniques (Su et al., 1996): • Mainly statistical learning techniques, guided by high-level linguistic constructs • Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc. • Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems • PANKOW (Cimiano et al., 2004): • Generates instances of lexico-syntactic patterns indicating a certain semantic or ontological relation • Counts number of occurrences of patterns • Statistical distribution of instances of these patterns constitutes the collective knowledge • Focus is on supporting annotation Dutch-Belgian Database Day 2009 (DBDBD 2009)

Conclusions • Three main approaches to NLP: • Statistics-based • Pattern-based • Hybrid • Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines: • Evaluate your problem, preferences, and available resources • If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach • If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach • If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach Dutch-Belgian Database Day 2009 (DBDBD 2009)

References • C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verb-particles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65-72. Association for Computational Linguistics, 2003. • W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, 2005. • P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004. • F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, 2009. • K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1):101-157, 1996. • R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics Association, 1999. Dutch-Belgian Database Day 2009 (DBDBD 2009)

A Survey of Approaches on Mining the Structure from Unstructured Data

A Survey of Approaches on Mining the Structure from Unstructured Data

Presentation Transcript

Profiting from Data Mining

GeoSpatial “Unstructured Data”

mining unstructured healthcare data deep dhillon

Data Mining of the Catalina Sky Survey (CSS) Archive

Data Mining Approaches for Intrusion Detection

Novel approaches of data-mining in experimental physics

A survey on using Bayes reasoning in Data Mining

A survey on stream data mining

Mining Gold from Data

A Survey of Opinion Mining

Data Mining Approaches in Atomistic Modeling

Managing Unstructured Data

Making Sense of Unstructured Data

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s)

Leveraging the Unstructured Data

A Survey of Methodaology of Fraud Detection Using Data Mining

From Unstructured Text to StructureD Data

A Survey on Significance for Changes Student's thoughts using Data Mining

A Survey of Approaches on Mining the Structure from Unstructured Data

Mining Structured vs. Unstructured Data Where is the structure and where did the semantics go?