Machine Learning for Information Integration on the Web

Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & TelecommunicationsNCSR “Demokritos” http://www.iit.demokritos.gr/skel Dagstuhl, February 15, 2005

SKEL Introduction SKEL’s research objective:innovative knowledge technologies for reducing the information overload on the Web • Areas of research activity: • Information gathering (retrieval, crawling, spidering) • Information filtering (text and multimedia classification) • Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) • Personalization (user stereotypes and communities) Machine Learning for Information Integration

Structure of the talk • Web Information integration in CROSSMARC • Learning Context Free Grammars • Meta-learning for Web Information Extraction • Machine Learning for Ontology Maintenance • Conclusions Machine Learning for Information Integration

SKEL Introduction CROSSMARC consortium • National Centre for Scientific Research "Demokritos” (GR) • University of Edinburgh (UK) • Universita di Roma Tor Vergata (IT) • VeltiNet A.E. (GR) • Lingway (FR) Machine Learning for Information Integration

CROSSMARC Objectives Develop technology for Information Integration that can: • crawl the Web for interesting Web pages, • extract information from pages of different sites without a standardized format (structured, semi-structured, free text), • process Web pages written in several languages, • be customized semi-automatically to new domains and languages, • deliver integrated information according to personalized profiles. Machine Learning for Information Integration

CROSSMARC Architecture Ontology Machine Learning for Information Integration

CROSSMARC Ontology <node idref="OV-d0e1041"> <synonym>Intel Pentium III</synonym> <synonym>Pentium III</synonym> <synonym>P3</synonym> <synonym>PIII</synonym> </node> … <description>Laptops</description> <features> <feature id="OF-d0e5"> <description>Processor</description> <attribute type="basic" id="OA-d0e7"> <description>Processor Name</description> <discrete_set type="open"> <value id="OV-d0e1041"> <description>Intel Pentium 3</description> </value> … Lexicon <node idref="OA-d0e7"> <synonym>Όνομα Επεξεργαστή</synonym> </node> Ontology Greek Lexicon Machine Learning for Information Integration

Learning Context Free Grammars Introducing eg-GRIDS • Infers context-free grammars. • Learns from positive examples only. • Overgenarisation controlled through a heuristic, based on MDL. • Two basic/three auxiliary learning operators. • Two search strategies: • Beam search. • Genetic search. Machine Learning for Information Integration

Learning Context Free Grammars Minimum Description Length (MDL) Derivations Description Length (DDL) Overly General Grammar Bits required to encode all training examples, as encoded by the grammar G. Grammar Description Length (GDL) Bits required to encode the grammar G. Hypotheses Overly Specific Grammar GDL DDL Model Length (ML) =GDL+DDL Machine Learning for Information Integration

Learning Context Free Grammars eg-GRIDS Architecture Overly Specific Grammar Training Examples Beam of Grammars Search Organisation Selection Evolutionary Algorithm Learning Operators MergeNTOperator Create Optional NT CreateNTOperator Mutation DetectCenterEmbedding BodySubstitution YES Final Grammar Any Inferred Grammar better than those in beam? NO Operator Mode Machine Learning for Information Integration

Meta-learning for Web IE Stacked generalization New vector x Base-level dataset D Dj D \ Dj C1...CN L1…LN Meta-level vector C1(j)…CN(j) L1…LN CM MDj LM Meta-level dataset MD Class value y(x) Machine Learning for Information Integration

Meta-learning for Web IE Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Each template is filled with instances <t(s,e), f> Machine Learning for Information Integration

Meta-learning for Web IE Combining Information Extraction systems Machine Learning for Information Integration

Meta-learning for Web IE Creating a stacked template Machine Learning for Information Integration

Meta-learning for Web IE Training in the new stacking framework D = set of documents, paired with hand-filled templates Dj D \ Dj E1…EN L1…LN E1(j)…EN(j) L1…LN ST1 ST2 … LM CM MDj MD = set of meta-level feature vectors Machine Learning for Information Integration

Meta-learning for Web IE Stacking at run-time E1 T1 Stacked template New document d CM E2 T2 … <t(s,e), f> EN TN Final template T Machine Learning for Information Integration

Ontology Enrichment • We concentrate on instances. • Highly evolving domain (e.g. laptop descriptions) • New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. • New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ • The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover. Machine Learning for Information Integration

Ontology Enrichment Annotating Corpus Using Domain Ontology machine learning Corpus Additional annotations Multi-Lingual Domain Ontology Information extraction Ontology Enrichment / Population Validation Domain Expert Machine Learning for Information Integration

Enrichment with synonyms • There is a need for supporting the enrichment of the ‘synonymy’ relationship. • The number of instances for validation increases with the size of the corpus and the ontology. • Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). • Issues to be handled: Synonym: ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical: ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Combination :‘IntellPentium 3’ - ‘P III’ Machine Learning for Information Integration

Compression-based Clustering • COCLU(COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. • CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. • COCLUiteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ). Machine Learning for Information Integration

SKEL Introduction Conclusions • Information integration can benefit from machine learning. • Grammar learning methods have become efficient. • Combining IE systems improves performance. • Ontologies can be used to annotate examples to learn IE systems and enrich ontologies. • Grammar learning in parallel/combination to ontology learning? Machine Learning for Information Integration

SKEL Introduction Acknowledgements • This is research of many current and past members of SKEL. • CROSSMARC is joint work of the project consortium. Machine Learning for Information Integration

Announcement IJCAI workshop Workshop on Grammatical Inference Applications: Successes and Future Challenges IJCAI-05, Edinburgh, Scotland July 31, 2005 Paper submission deadline: March 19, 2005 URL: http://www.ics.mq.edu.au/~menno/IJCAI05/ Machine Learning for Information Integration

Machine Learning for Information Integration on the Web

Machine Learning for Information Integration on the Web

Presentation Transcript

Finding Information on the web

MapReduce for Machine Learning on Multicore

Machine Learning on Spark

Machine Learning on Spark

Machine Learning for multimedia information retrieval

Web Mining: Machine Learning for Web Applications

Integration on the Web

Information Resources on The Web

Plain Text Information Extraction (based on Machine Learning )

Machine Learning For the Web: A Unified View

A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web

Machine Learning and Information Retrieval

Machine Learning on Images

Integration of Friendly Data Islands on the Web. Information Extraction.

Introduction to Machine Learning for Information Retrieval

Machine Learning for Information Extraction

When Machine Learning Meets the Web

Machine Learning on Spark

Saby on Machine Learning

The Effect of Machine Learning on Web Application Development

MapReduce for Machine Learning on Multicore

Machine Learning for Personal Information Management