190 likes | 296 Views
Ontology Learning and Population using Heterogeneous Sources on the Web. Victor de Boer OLP-AIO’s Workshop March, 16 th , 2005. About me. Victor de Boer Artificial Intelligence @ UvA Graduated on Human Memory modelling AiO since jan 1 st 2004
E N D
Ontology Learning and Population using Heterogeneous Sources on the Web Victor de Boer OLP-AIO’s Workshop March, 16th, 2005
About me • Victor de Boer • Artificial Intelligence @ UvA • Graduated on Human Memory modelling • AiO since jan 1st 2004 • Supervisors: Bob Wielinga and Maarten van Someren • MultimediaN • (Mn-9c: VU, CWI, DEN)
Outline • Introduction and Research Questions • Ontology Learning and Population Task • My approach: Redundancy-based • Case Study • Results • Further Research • Questions / Discussion
Intro and Research Questions • Backbone of Semantic Web: • Ontologies • Content • Manual construction has its flaws and is also very time-consuming. • Web contains a lot of knowledge: let’s use it. • My research questions: • How can we automatically construct, enrich and populate ontologies using heterogeneous sources on the Web? • And how can these ontologies help us in extracting more information? (bootstrap)
OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … C1 C3 C2 C4
OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… C1 C3 C2 C4
OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations C1 C3 C2 C4
OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations • Ontology Population • Instances C1 C3 C2 C4 I1 I3 I2
OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations • Ontology Population • Instances • Relation Instances C1 C3 C2 C4 I1 I3 I2
OLP Task Description • Ontology Learning: • Concepts: • NERC, LSI, … • Hierarchical Structure • Hearst Patterns,… • Other relations • Ontology Population • Instances • Relation Instances • Ontology Enrichment C1 C3 C2 C4 I1 I3 I2
Relation Instantiation • We have: • two Concepts C1 and C2, • a relation R(C1,C2) • and instances I1 of C1 and I2 of C2. • Find for which instances the relation R holds. • Examples: • <Countries, has_city, City> • <Movie, has_director, Director> • <Artstyle, has_artist, Artist> • Information Extraction!
Approaches • Current approaches: • NLP based. Work well for Natural language documents • Wrapper-like. Work well with (semi-)structured documents • Not a generic approach • My approach: • Use generic methods, applicable to heterogeneous sources, combining information to collect evidence of this relation. Redundancy of information should compensate for the loss of subtlety.
Case Study: Domain • Art and Architecture Thesaurus (AAT) • Unified List of Artist Names (ULAN) • Relation: <aat:style, aua:has_artist, ulan:artist> • Find instances of this relation Has_artist
Case Study: Method Manual wrapper Person Name Extractor ULAN-check Seed list AAT Otto Dix Otto Dix Otto Dix S. Freud George Grosz George Grosz George Grosz Score: “George Grosz” + 0.5
Case Study: Results • Impressionism: 200 pages (+/-120 used) Seed Artists: Degas, Gauguin, Boudin, Morisot, Caillebotte, Seurat, Monet, Renoir, Manet sisley, alfred ; 0.08 ; ulan#19582 cassatt, mary ; 0.0780414 ; ulan#8671 cezanne, paul ; 0.0764626 ; ulan#9730 bazille, frederic ; 0.0394824 ; ulan#2147 signac, paul ; 0.0265291 ; ulan#19142 guillaumin, armand ; 0.0263668 ; ulan#11549 gustave courbet ; 0.0218521 ; ulan#12992 bonnard, pierre ; 0.0149454 ; ulan#4215 henri matisse ; 0.0134152 ; ulan#5698 camille corot ; 0.0128969 ; ulan#10536 d'orsay ; 0.0123066 ; ulan#28304 auguste rodin ; 0.0115357 ; ulan#17831 theodore rousseau ; 0.011157 ; ulan#18605 childe hassam ; 0.0107054 ; ulan#12300
Case Study: Results • Evaluation problems • 18 Impressionists (Gold Standard)
Assumptions, Limitations • Conclusions: • It seems to work • Evaluation a problem • Assumptions: • The redundancy of information we extract by using multiple, heterogeneous sources compensates what we lose by not using more ‘sophisticated’ methods • R must be one-to-many relation (no functional properties) • C1 must be ‘googlable’ • C2 must be ‘extractable’
Further Research • Collect more results (how robust is it?) • Different domains • More heterogeneous sources (dB’s), offline dictionaries… • Use page classification/trustability • Evaluation • Use Ontological information