420 likes | 429 Views
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning. Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/~paliourg. Kassel, 22 July 2005. Outline.
E N D
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Georgios PaliourasSoftware & Knowledge Engineering LabInst. of Informatics & TelecommunicationsNCSR “Demokritos”http://www.iit.demokritos.gr/~paliourg Kassel, 22 July 2005
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
Motivation • Practical information extraction requires a conceptual description of the domain, e.g. an ontology, and a grammar. • Manual creation and maintenance of these resources is expensive. • Machine learning has been used to: • Learn ontologies based on extracted instances. • Learn extraction grammars, given the conceptual model. • Study how the two processes are interacting and the possibility of combining them. Kassel, 22/07/2005 ICCS’05
Information extraction • Common approach: shallow parsing with regular grammars. • Limited use of deep analysis to improve extraction accuracy (HPSGs, concept graphs). • Linking of extraction patterns to ontologies (e.g. information extraction ontologies). • Initial attempts to combine syntax and semantics (Systemic Functional Grammars). • Learning simple extraction patterns (regular expressions, HMMs, tree-grammars, etc.) Kassel, 22/07/2005 ICCS’05
Ontology learning • Deductive approach to ontology modification: driven by linguistic rules. • Inductive identification of new concepts/terms. • Clustering, based on lexico-syntactic analysis of the text (subcat frames). • Formal Concept Analysis for term clustering and concept identification. • Clustering and merging of conceptual graphs (conceptual graph theory). • Deductive learning of extraction grammars in parallel with the identification of concepts. Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
SKEL - vision Research objective:innovative knowledge technologies for reducing the information overload on the Web Areas of research activity: • Information gathering (retrieval, crawling, spidering) • Information filtering (text and multimedia classification) • Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) • Personalization (user stereotypes and communities) • Ontology learning and population Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
CROSSMARC Objectives • crawl the Web for interesting Web pages, • extract information from pages of different sites without a standardized format (structured, semi-structured, free text), • process Web pages written in several languages, • be customized semi-automatically to new domains and languages, • deliver integrated information according to personalized profiles. Develop technology for Information Integration that can: Kassel, 22/07/2005 ICCS’05
CROSSMARC Architecture Ontology Kassel, 22/07/2005 ICCS’05
CROSSMARC Ontology <node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym> </node> … <description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> … Lexicon <node idref="OA-d0e7"> <synonym>Όνομα Επεξεργαστή</synonym> </node> Ontology Greek Lexicon Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Motivation: • There are many different learning methods, producing different types of extraction grammar. • In CROSSMARC we had four different approaches with significant difference in the extracted information. Proposed approach: • Use meta-learning to combine the strengths of individual learning methods. Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Stacked generalization New vector x Base-level dataset D Dj D \ Dj C1...CN L1…LN Meta-level vector C1(j)…CN(j) L1…LN CM MDj LM Meta-level dataset MD Class value y(x) Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Each template is filled with instances <t(s,e), f> Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Combining Information Extraction systems Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Creating a stacked template Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Training in the new stacking framework D = set of documents, paired with hand-filled templates Dj D \ Dj E1…EN L1…LN E1(j)…EN(j) L1…LN ST1 ST2 … LM CM MDj MD = set of meta-level feature vectors Kassel, 22/07/2005 ICCS’05
Meta-learning for Web IE Stacking at run-time E1 T1 Stacked template New document d CM E2 T2 … <t(s,e), f> EN TN Final template T Kassel, 22/07/2005 ICCS’05
Experimental results Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
Learning CFGs Motivation: • Wanting to provide more complex extraction patterns for less structured text. • Wanting to learn more compact and human-comprehensible grammars. • Wanting to be able to process large corpora containing only positive examples. Proposed approach: • Efficient learning of context free grammars from positive examples, guided by Minimum Description Length. Kassel, 22/07/2005 ICCS’05
Learning CFGs • Infers context-free grammars. • Learns from positive examples only. • Overgenarisation controlled through a heuristic, based on MDL. • Two basic/three auxiliary learning operators. • Two search strategies: • Beam search. • Genetic search. Introducing eg-GRIDS Kassel, 22/07/2005 ICCS’05
Minimum Description Length (MDL) Derivations Description Length (DDL) Overly General Grammar Bits required to encode all training examples, as encoded by the grammar G. Grammar Description Length (GDL) Bits required to encode the grammar G. Hypotheses Overly Specific Grammar GDL DDL Learning CFGs Model Length (ML) =GDL+DDL Kassel, 22/07/2005 ICCS’05
eg-GRIDS Architecture Overly Specific Grammar Training Examples Beam of Grammars Search Organisation Selection Evolutionary Algorithm Learning Operators MergeNTOperator Create Optional NT CreateNTOperator Mutation DetectCenterEmbedding BodySubstitution YES Final Grammar Any Inferred Grammar better than those in beam? NO Operator Mode Learning CFGs Kassel, 22/07/2005 ICCS’05
Experimental results • The Dyck language with k=1:S → S S | ( S ) | є Errors of: • Omission: failures to parse sentences generated from the “correct” grammar (longer test sentences than in the training set). • Overly specific grammar. • Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar. • Overly general grammar. Kassel, 22/07/2005 ICCS’05
Probability of parsing a valid sentence (1-errors of omission) Experimental results Kassel, 22/07/2005 ICCS’05
Probability of generating a valid sentence (1-errors of commission) Experimental results Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
Ontology Enrichment • We concentrate on instances. • Highly evolving domain (e.g. laptop descriptions) • New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. • New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ • The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain. Kassel, 22/07/2005 ICCS’05
Ontology Enrichment Annotating Corpus Using Domain Ontology machine learning Corpus Additional annotations Multi-Lingual Domain Ontology Information extraction Ontology Enrichment / Population Validation Domain Expert Kassel, 22/07/2005 ICCS’05
Finding synonyms • The number of instances for validation increases with the size of the corpus and the ontology. • There is a need for supporting the enrichment of the ‘synonymy’ relationship. • Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). • Issues to be handled: Synonym: ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical: ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Combination :‘IntellPentium 3’ - ‘P III’ Kassel, 22/07/2005 ICCS’05
COCLU • COCLU(COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. • CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. • COCLUiteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ). Kassel, 22/07/2005 ICCS’05
100 90 80 Accuracy (%) 70 60 50 0 20 40 60 80 Instances removed (%) Experimental results Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • BOEMIE: Bootstrapping ontology evolution with multimediainformation extraction. • Open issues Kassel, 22/07/2005 ICCS’05
BOEMIE - motivation • Multimedia content grows with increasing rates in public and proprietary webs. • Hard to provide semantic indexing of multimedia content. • Significant advances in automatic extraction of low-level features from visual content. • Little progress in the identification of high-level semantic features • Little progress in the effective combination of semantic features from different modalities. • Great effort in producing ontologies for semantic webs. • Hard to build and maintain domain-specific multimedia ontologies. Kassel, 22/07/2005 ICCS’05
ONTOLOGY EVOLUTION TOOLKIT ONTOLOGY MANAGEMENT TOOL LEARNING TOLS SEMANTICS EXTRACTION REASONING ENGINE FROM VISUAL CONTENT FROM NON-VISUAL CONTENT MATCHING TOOLS FROM FUSED CONTENT MULTIMEDIA CONTENT INITIAL ONTOLOGY SEMANTICS EXTRACTION TOOLKIT VISUAL EXTRACTION TOOLS TEXT EXTRACTION TOOLS AUDIO EXTRACTION TOOLS INFORMATION FUSION TOOLS BOEMIE- approach OTHER ONTOLOGIES Content Collection (crawlers, spiders, etc.) SEMANTICS EXTRACTION RESULTS ONTOLOGY EVOLUTION EVOLVED ONTOLOGY COORDINATION POPULATION & ENRICHMENT INTERMEDIATE ONTOLOGY Kassel, 22/07/2005 ICCS’05
Outline • Motivation and state of the art • SKEL research • Vision • Information integration in CROSSMARC. • Meta-learning for information extraction. • Context-free grammar learning. • Ontology enrichment. • Bootstrapping ontology evolution with multimedia information extraction. • Open issues Kassel, 22/07/2005 ICCS’05
KR issues • Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE? • Is that better than having separate representations for different tasks? • Do we need an intermediate formalism (e.g. grammar + CG + ontology)? • Do we need to represent uncertainty (e.g. using probabilistic graphical models)? Kassel, 22/07/2005 ICCS’05
ML issues • What types and which aspects of grammars and conceptual structures can we learn? • What training data do we need? Can we reduce the manual annotation effort? • What background knowledge do we need and what is the role of deduction? • What is the role of multi-strategy learning, especially if complex representations are used? Kassel, 22/07/2005 ICCS’05
Content-type issues • What is the role of semantically annotated content in learning, e.g. as training data? • What is the role of hypertext as a graph? • Can we extract information from multimedia content? • How can ontologies and learning help improve extraction from multimedia? Kassel, 22/07/2005 ICCS’05
SKEL Introduction • This is research of many current and past members of SKEL. • CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway). Acknowledgements Kassel, 22/07/2005 ICCS’05