270 likes | 282 Views
Explore innovative knowledge technologies for reducing web information overload through information gathering, filtering, extraction, and personalization. Learn about CROSSMARC consortium's objectives and architecture for seamless information integration.
E N D
Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & TelecommunicationsNCSR “Demokritos” http://www.iit.demokritos.gr/skel Dagstuhl, February 15, 2005
SKEL Introduction SKEL’s research objective:innovative knowledge technologies for reducing the information overload on the Web • Areas of research activity: • Information gathering (retrieval, crawling, spidering) • Information filtering (text and multimedia classification) • Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) • Personalization (user stereotypes and communities) Machine Learning for Information Integration
Structure of the talk • Web Information integration in CROSSMARC • Learning Context Free Grammars • Meta-learning for Web Information Extraction • Machine Learning for Ontology Maintenance • Conclusions Machine Learning for Information Integration
SKEL Introduction CROSSMARC consortium • National Centre for Scientific Research "Demokritos” (GR) • University of Edinburgh (UK) • Universita di Roma Tor Vergata (IT) • VeltiNet A.E. (GR) • Lingway (FR) Machine Learning for Information Integration
CROSSMARC Objectives Develop technology for Information Integration that can: • crawl the Web for interesting Web pages, • extract information from pages of different sites without a standardized format (structured, semi-structured, free text), • process Web pages written in several languages, • be customized semi-automatically to new domains and languages, • deliver integrated information according to personalized profiles. Machine Learning for Information Integration
CROSSMARC Architecture Ontology Machine Learning for Information Integration
CROSSMARC Ontology <node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym> </node> … <description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> … Lexicon <node idref="OA-d0e7"> <synonym>Όνομα Επεξεργαστή</synonym> </node> Ontology Greek Lexicon Machine Learning for Information Integration
Structure of the talk • Web Information integration in CROSSMARC • Learning Context Free Grammars • Meta-learning for Web Information Extraction • Machine Learning for Ontology Maintenance • Conclusions Machine Learning for Information Integration
Learning Context Free Grammars Introducing eg-GRIDS • Infers context-free grammars. • Learns from positive examples only. • Overgenarisation controlled through a heuristic, based on MDL. • Two basic/three auxiliary learning operators. • Two search strategies: • Beam search. • Genetic search. Machine Learning for Information Integration
Learning Context Free Grammars Minimum Description Length (MDL) Derivations Description Length (DDL) Overly General Grammar Bits required to encode all training examples, as encoded by the grammar G. Grammar Description Length (GDL) Bits required to encode the grammar G. Hypotheses Overly Specific Grammar GDL DDL Model Length (ML) =GDL+DDL Machine Learning for Information Integration
Learning Context Free Grammars eg-GRIDS Architecture Overly Specific Grammar Training Examples Beam of Grammars Search Organisation Selection Evolutionary Algorithm Learning Operators MergeNTOperator Create Optional NT CreateNTOperator Mutation DetectCenterEmbedding BodySubstitution YES Final Grammar Any Inferred Grammar better than those in beam? NO Operator Mode Machine Learning for Information Integration
Structure of the talk • Web Information integration in CROSSMARC • Learning Context Free Grammars • Meta-learning for Web Information Extraction • Machine Learning for Ontology Maintenance • Conclusions Machine Learning for Information Integration
Meta-learning for Web IE Stacked generalization New vector x Base-level dataset D Dj D \ Dj C1...CN L1…LN Meta-level vector C1(j)…CN(j) L1…LN CM MDj LM Meta-level dataset MD Class value y(x) Machine Learning for Information Integration
Meta-learning for Web IE Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Each template is filled with instances <t(s,e), f> Machine Learning for Information Integration
Meta-learning for Web IE Combining Information Extraction systems Machine Learning for Information Integration
Meta-learning for Web IE Creating a stacked template Machine Learning for Information Integration
Meta-learning for Web IE Training in the new stacking framework D = set of documents, paired with hand-filled templates Dj D \ Dj E1…EN L1…LN E1(j)…EN(j) L1…LN ST1 ST2 … LM CM MDj MD = set of meta-level feature vectors Machine Learning for Information Integration
Meta-learning for Web IE Stacking at run-time E1 T1 Stacked template New document d CM E2 T2 … <t(s,e), f> EN TN Final template T Machine Learning for Information Integration
Structure of the talk • Web Information integration in CROSSMARC • Learning Context Free Grammars • Meta-learning for Web Information Extraction • Machine Learning for Ontology Maintenance • Conclusions Machine Learning for Information Integration
Ontology Enrichment • We concentrate on instances. • Highly evolving domain (e.g. laptop descriptions) • New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. • New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ • The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover. Machine Learning for Information Integration
Ontology Enrichment Annotating Corpus Using Domain Ontology machine learning Corpus Additional annotations Multi-Lingual Domain Ontology Information extraction Ontology Enrichment / Population Validation Domain Expert Machine Learning for Information Integration
Enrichment with synonyms • There is a need for supporting the enrichment of the ‘synonymy’ relationship. • The number of instances for validation increases with the size of the corpus and the ontology. • Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). • Issues to be handled: Synonym: ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical: ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Combination :‘IntellPentium 3’ - ‘P III’ Machine Learning for Information Integration
Compression-based Clustering • COCLU(COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. • CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. • COCLUiteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ). Machine Learning for Information Integration
Structure of the talk • Web Information integration in CROSSMARC • Learning Context Free Grammars • Meta-learning for Web Information Extraction • Machine Learning for Ontology Maintenance • Conclusions Machine Learning for Information Integration
SKEL Introduction Conclusions • Information integration can benefit from machine learning. • Grammar learning methods have become efficient. • Combining IE systems improves performance. • Ontologies can be used to annotate examples to learn IE systems and enrich ontologies. • Grammar learning in parallel/combination to ontology learning? Machine Learning for Information Integration
SKEL Introduction Acknowledgements • This is research of many current and past members of SKEL. • CROSSMARC is joint work of the project consortium. Machine Learning for Information Integration
Announcement IJCAI workshop Workshop on Grammatical Inference Applications: Successes and Future Challenges IJCAI-05, Edinburgh, Scotland July 31, 2005 Paper submission deadline: March 19, 2005 URL: http://www.ics.mq.edu.au/~menno/IJCAI05/ Machine Learning for Information Integration