160 likes | 309 Views
A Survey of Approaches on Mining the Structure from Unstructured Data. Introduction. A lot of data is generated every day Difficult to find information that meets one’s needs There is a need to mine the structure of data as a first step towards understanding it
E N D
A Survey of Approaches on Mining the Structure from Unstructured Data Dutch-Belgian Database Day 2009 (DBDBD 2009)
Introduction • A lot of data is generated every day • Difficult to find information that meets one’s needs • There is a need to mine the structure of data as a first step towards understanding it • Part of the effort to make the Web machine-understandable • Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language Dutch-Belgian Database Day 2009 (DBDBD 2009)
Which Technique to Choose? Dutch-Belgian Database Day 2009 (DBDBD 2009)
Statistics-Based NLP (1) • Utilize statistics and mathematical models based on probability theory • Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including: • Probabilistic modeling • Information theory • Linear algebra • Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations Dutch-Belgian Database Day 2009 (DBDBD 2009)
Statistics-Based NLP (2) • Word-based: • Statistics collection on words • Frequency counting and ranking generation (e.g., TF-IDF) • Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.) • Word Sense Disambiguation (WSD) • Inference models: n-grams • Clustering • Grammar-based: • Part-Of-Speech (POS) tagging • Stochastic Context-Free Grammars (SCFG) Dutch-Belgian Database Day 2009 (DBDBD 2009)
Statistics-Based NLP (3) • Advantages: • Not based on knowledge, thus they do not require linguistic resources, nor do they require expert knowledge • Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated • Disadvantages: • Often need a large amount of data • Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics Dutch-Belgian Database Day 2009 (DBDBD 2009)
Statistics-Based NLP (4) • Examples: • (Bannard et al., 2003) discuss several techniques for using statistical models acquired from corpus data to infer the meaning of verb-particle constructions: • Collocation-like approach, frequency counting • Focus on mining relations between words • (Taira and Soderland, 1999) implement a statistical natural language processor: • Based on resonance probabilities between word pairs • Uses word affinity knowledge from training sentences • Focus on acquiring knowledge from radiology reports Dutch-Belgian Database Day 2009 (DBDBD 2009)
Pattern-Based NLP (1) • Use linguistic patterns to extract data from texts • Patterns can be: • Predefined • Discovered (learned) • Knowledge used: • Lexical knowledge • Syntactic knowledge • Semantic knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)
Pattern-Based NLP (2) • Lexico-syntactic patterns: • Combine lexical and syntactic elements with regular expressions • E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons • Lexico-semantic patterns: • Enrich lexico-syntactic patterns through the addition of semantics • Gazetteers (simple typing): • Use linguistic meaning of text • E.g., “[sub:company]announces collaboration with[obj:company]” • Ontologies (complex typing): • Include also relationships • E.g., “[kb:Company]kb:collaborates[kb:Company]” Dutch-Belgian Database Day 2009 (DBDBD 2009)
Pattern-Based NLP (3) • Advantages: • Need less training data • Complex expressions can be defined • Results are easily interpretable • Disadvantages: • Lexical knowledge is required • Prior expert/domain knowledge might be required (for lexico-semantic patterns) • Defining and maintaining patterns is a cumbersome and non-trivial task Dutch-Belgian Database Day 2009 (DBDBD 2009)
Pattern-Based NLP (4) • Examples: • CAFETIERE (Black et al., 2005): • Employs extraction rules defined at lexico-semantic level • Makes use of gazetteering • Knowledge is stored using Narrative Knowledge Representation Language (NKRL) • Knowledge base lacks reasoning support • Focus on extracting relations from corpora • Hermes (Frasincar et al., 2009): • Patterns defined at lexico-semantic level • Makes use of ontologies and reasoning engines • Knowledge is based on an OWL domain ontology • Focus on the use of pattern-based NLP in building personalized news services Dutch-Belgian Database Day 2009 (DBDBD 2009)
Hybrid NLP (1) • Combine linguistic knowledge with statistical methods • Usually, it appears to be difficult to stay within the boundaries of a single approach • Thus, it is convenient to combine best from both worlds: • Bootstrapping lexical methods • Solving lack of expert knowledge by applying statistical methods • Statistical methods that use some present (lexical) knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)
Hybrid NLP (2) • Advantages: • Solve problems related to scaling and required expert knowledge of pattern-based approaches • Do not require as much data as statistical approaches • Inherit some of the advantages of both statistical and pattern-based approaches • Disadvantages: • By combining different techniques, maintaining completeness and accuracy of the systems becomes more difficult • Multidisciplinary aspects • Inherit some of the disadvantages of both statistical and pattern-based approaches Dutch-Belgian Database Day 2009 (DBDBD 2009)
Hybrid NLP (3) • Examples: • Corpus-Based Statistics-Oriented techniques (Su et al., 1996): • Mainly statistical learning techniques, guided by high-level linguistic constructs • Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc. • Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems • PANKOW (Cimiano et al., 2004): • Generates instances of lexico-syntactic patterns indicating a certain semantic or ontological relation • Counts number of occurrences of patterns • Statistical distribution of instances of these patterns constitutes the collective knowledge • Focus is on supporting annotation Dutch-Belgian Database Day 2009 (DBDBD 2009)
Conclusions • Three main approaches to NLP: • Statistics-based • Pattern-based • Hybrid • Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines: • Evaluate your problem, preferences, and available resources • If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach • If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach • If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach Dutch-Belgian Database Day 2009 (DBDBD 2009)
References • C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verb-particles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65-72. Association for Computational Linguistics, 2003. • W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, 2005. • P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004. • F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, 2009. • K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1):101-157, 1996. • R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics Association, 1999. Dutch-Belgian Database Day 2009 (DBDBD 2009)