1 / 16

A Survey of Approaches on Mining the Structure from Unstructured Data

A Survey of Approaches on Mining the Structure from Unstructured Data. Introduction. A lot of data is generated every day Difficult to find information that meets one’s needs There is a need to mine the structure of data as a first step towards understanding it

pierce
Download Presentation

A Survey of Approaches on Mining the Structure from Unstructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey of Approaches on Mining the Structure from Unstructured Data Dutch-Belgian Database Day 2009 (DBDBD 2009)

  2. Introduction • A lot of data is generated every day • Difficult to find information that meets one’s needs • There is a need to mine the structure of data as a first step towards understanding it • Part of the effort to make the Web machine-understandable • Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language Dutch-Belgian Database Day 2009 (DBDBD 2009)

  3. Which Technique to Choose? Dutch-Belgian Database Day 2009 (DBDBD 2009)

  4. Statistics-Based NLP (1) • Utilize statistics and mathematical models based on probability theory • Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including: • Probabilistic modeling • Information theory • Linear algebra • Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations Dutch-Belgian Database Day 2009 (DBDBD 2009)

  5. Statistics-Based NLP (2) • Word-based: • Statistics collection on words • Frequency counting and ranking generation (e.g., TF-IDF) • Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.) • Word Sense Disambiguation (WSD) • Inference models: n-grams • Clustering • Grammar-based: • Part-Of-Speech (POS) tagging • Stochastic Context-Free Grammars (SCFG) Dutch-Belgian Database Day 2009 (DBDBD 2009)

  6. Statistics-Based NLP (3) • Advantages: • Not based on knowledge, thus they do not require linguistic resources, nor do they require expert knowledge • Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated • Disadvantages: • Often need a large amount of data • Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics Dutch-Belgian Database Day 2009 (DBDBD 2009)

  7. Statistics-Based NLP (4) • Examples: • (Bannard et al., 2003) discuss several techniques for using statistical models acquired from corpus data to infer the meaning of verb-particle constructions: • Collocation-like approach, frequency counting • Focus on mining relations between words • (Taira and Soderland, 1999) implement a statistical natural language processor: • Based on resonance probabilities between word pairs • Uses word affinity knowledge from training sentences • Focus on acquiring knowledge from radiology reports Dutch-Belgian Database Day 2009 (DBDBD 2009)

  8. Pattern-Based NLP (1) • Use linguistic patterns to extract data from texts • Patterns can be: • Predefined • Discovered (learned) • Knowledge used: • Lexical knowledge • Syntactic knowledge • Semantic knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)

  9. Pattern-Based NLP (2) • Lexico-syntactic patterns: • Combine lexical and syntactic elements with regular expressions • E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons • Lexico-semantic patterns: • Enrich lexico-syntactic patterns through the addition of semantics • Gazetteers (simple typing): • Use linguistic meaning of text • E.g., “[sub:company]announces collaboration with[obj:company]” • Ontologies (complex typing): • Include also relationships • E.g., “[kb:Company]kb:collaborates[kb:Company]” Dutch-Belgian Database Day 2009 (DBDBD 2009)

  10. Pattern-Based NLP (3) • Advantages: • Need less training data • Complex expressions can be defined • Results are easily interpretable • Disadvantages: • Lexical knowledge is required • Prior expert/domain knowledge might be required (for lexico-semantic patterns) • Defining and maintaining patterns is a cumbersome and non-trivial task Dutch-Belgian Database Day 2009 (DBDBD 2009)

  11. Pattern-Based NLP (4) • Examples: • CAFETIERE (Black et al., 2005): • Employs extraction rules defined at lexico-semantic level • Makes use of gazetteering • Knowledge is stored using Narrative Knowledge Representation Language (NKRL) • Knowledge base lacks reasoning support • Focus on extracting relations from corpora • Hermes (Frasincar et al., 2009): • Patterns defined at lexico-semantic level • Makes use of ontologies and reasoning engines • Knowledge is based on an OWL domain ontology • Focus on the use of pattern-based NLP in building personalized news services Dutch-Belgian Database Day 2009 (DBDBD 2009)

  12. Hybrid NLP (1) • Combine linguistic knowledge with statistical methods • Usually, it appears to be difficult to stay within the boundaries of a single approach • Thus, it is convenient to combine best from both worlds: • Bootstrapping lexical methods • Solving lack of expert knowledge by applying statistical methods • Statistical methods that use some present (lexical) knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)

  13. Hybrid NLP (2) • Advantages: • Solve problems related to scaling and required expert knowledge of pattern-based approaches • Do not require as much data as statistical approaches • Inherit some of the advantages of both statistical and pattern-based approaches • Disadvantages: • By combining different techniques, maintaining completeness and accuracy of the systems becomes more difficult • Multidisciplinary aspects • Inherit some of the disadvantages of both statistical and pattern-based approaches Dutch-Belgian Database Day 2009 (DBDBD 2009)

  14. Hybrid NLP (3) • Examples: • Corpus-Based Statistics-Oriented techniques (Su et al., 1996): • Mainly statistical learning techniques, guided by high-level linguistic constructs • Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc. • Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems • PANKOW (Cimiano et al., 2004): • Generates instances of lexico-syntactic patterns indicating a certain semantic or ontological relation • Counts number of occurrences of patterns • Statistical distribution of instances of these patterns constitutes the collective knowledge • Focus is on supporting annotation Dutch-Belgian Database Day 2009 (DBDBD 2009)

  15. Conclusions • Three main approaches to NLP: • Statistics-based • Pattern-based • Hybrid • Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines: • Evaluate your problem, preferences, and available resources • If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach • If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach • If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach Dutch-Belgian Database Day 2009 (DBDBD 2009)

  16. References • C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verb-particles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65-72. Association for Computational Linguistics, 2003. • W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, 2005. • P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004. • F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, 2009. • K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1):101-157, 1996. • R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics Association, 1999. Dutch-Belgian Database Day 2009 (DBDBD 2009)

More Related