270 likes | 401 Views
Semantically Conceptualizing and Annotating Tables. Stephen Lynn & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University. Supported by the. Overview. Context WoK: Web of Knowledge TANGO: Table ANalysis for Generating Ontologies
E N D
Semantically Conceptualizing and Annotating Tables Stephen Lynn & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by the
Overview • Context • WoK: Web of Knowledge • TANGO: Table ANalysis for Generating Ontologies • MOGO: Mini-Ontology GeneratOr • Semantic Enrichment via MOGO • Implementation • Experimentation • Enhancements • Challenges & Opportunities
TANGO TANGO repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology. Growing Ontology
MOGO TANGO repeatedly turns raw tables into conceptual mini-ontologies and integrates them into a growing ontology. Growing Ontology MOGO generates mini-ontologies from interpreted tables.
MOGO Overview • Table • Interpretation • Yields a canonical table • Canonical Table • Concept/Value Recognition • Relationship Discovery • Constraint Discovery • Yields a semantically enriched conceptual model • Mini-ontology • Integration into a growing ontology MOGO
Sample Input Sample Output
Concept/Value Recognition • Lexical Clues • Labels as data values • Data value assignment • Data Frame Clues • Labels as data values • Data value assignment • Default • Recognize concepts and values by syntax and layout
Concept/Value Recognition • Lexical Clues • Labels as data values • Data value assignment • Data Frame Clues • Labels as data values • Data value assignment • Default • Recognize concepts and values by syntax and layout Concepts and Value Assignments Location Region State Northeast Northwest Delaware Maine Oregon Washington
Concept/Value Recognition • Lexical Clues • Labels as data values • Data value assignment • Data Frame Clues • Labels as data values • Data value assignment • Default • Recognize concepts and values by syntax and layout Year 2002 2003 Concepts and Value Assignments Location Region State Population Latitude Longitude Northeast Northwest Delaware Maine Oregon Washington 2,122,869 817,376 1,305,493 9,690,665 3,559,547 6,131,118 45 44 45 43 -90 -93 -120 -120
Relationship Discovery 2000 • Dimension Tree Mappings • Lexical Clues • Generalization/Specialization • Aggregation • Data Frames • Ontology Fragment Merge
Relationship Discovery • Dimension Tree Mappings • Lexical Clues • Generalization/Specialization • Aggregation • Data Frames • Ontology Fragment Merge
Constraint Discovery • Generalization/Specialization • Computed Values • Functional Relationships • Optional Participation
Validation • Concept/Value Recognition • Correctly identified concepts • Missed concepts • False positives • Data values assignment • Relationship Discovery • Valid relationship sets • Invalid relationship sets • Missed relationship sets • Constraint Discovery • Valid constraints • Invalid constraints • Missed constraints
Concept Recognition • Counted: • Correct/Incorrect/Missing Concepts • Correct/Incorrect/Missing Labels • Data value assignments
Relationship Discovery • Counted: • Correct/incorrect/missing relationship sets • Correct/incorrect/missing aggregations and generalization/specializations
Constraint Discovery • Counted: • Correct/Incorrect/Missing: • Generalization/Specialization constraints • Computed value constraints • Functional constraints • Optional constraints
Concept Recognition • Successes • 98% of concepts identified • Missing label identification • 97% of values assigned to correct concept • Common problems • Finding an appropriate label • Duplicate concepts
Relationship Discovery • Recall of 92% for relationship sets • Missing aggregations and gen./spec.’s (only found in label nesting) • Unnecessary rel. sets generated (are computable)
Constraint Discovery • F-measure of 98% for functional relationship sets • Computed value discovery • Funtional/non-functional lists in cells
MOGO Contributions • Tool to generate mini-ontologies • Accuracy encouraging
Opportunities & Challenges: MOGO • Enhancements • Check for inter-label relationships • Check for more complex computations • Check for lists in cells • … • Wish List • Data-frame library • Atomic knowledge components • Instance recognizers • Library of molecular components • Semi-automatic construction of a WordNet-like resource for knowledge components
Summary • MOGO • Semantic Enrichment • Encouraging Results • But More Possible • Broader Implications ~ Vision & Challenges • TANGO • WoK • Web of Data • Semantic Annotation • User-friendly Query Answering www.deg.byu.edu embley@cs.byu.edu
Opportunities & Challenges: TANGO • Table Interpretation • Transforming tables to F-logic [Pivk07] • Layout-independent table representation [Jha08] • Table interpretation by sibling tables [Tao07] • Semantic Enhancement / Ontology Generation • Naming unnamed table concepts [Pivk07] • MOGO [Lynn09] • Semi-automatic Ontology Integration • Ontology Matching [Euzenat07] • Ontology-mapping tools [Falconer07] • Direct and indirect schema mappings for TANGO [Xu06]
Opportunities & Challenges: WoK • Web of Data • “The Semantic Web is a web of data.” [W3C] • Upcoming special issue of Journal of Web Semantics • “Enabling a Web of Knowledge” [Tao09] • Information Extraction • Domain-independent IE from web tables [Gatterbauer07] • Open IE [Banko07] • …
Opportunities & Challenges: WoK • … • Semantic Annotation wrt Ontologies • Linking Data to Ontologies [Poggi08] • TISP [Tao07] • FOCIH [Tao09] • Reasoning & Query Answering • Description Logics [Baadar03] • NLIDB Community • AskOntos [Ding06] • SerFR [Al-Muhammed07]
References • [Al-Muhammed07] Al-Muhammed and Embley, “Ontology-Based Constraint Recognition for Free-Form Service Requests”, Proceedings of the 23rd International Conference on Data Engineering, 2007. • [Baader, Calvanese, McGuinness, Nardi and Patel-Schneider, The Description Logic Handbook, Cambridge University Press, 2003. • [Banko07] Banko, Cafarella, Soderland, Broadhead and Etzioni, “Open Information Extraction from the Web”, Proceedings of the International Joint Conference on Artificial Intelligence, 2007. • [Ding06] Ding, Embley and Liddle, “Automatic Creation and Simplified Querying of Semantic Web Content: An Approach Based on Information-Extraction Ontologies”, Proceedings of the First Asian Semantic Web Conference, 2006. • [Euzenat07] Eusenat and Shvaiko, Ontology Matching, Springer Verlag, 2007. • [Falconer07] Falconer, Noy and Storey, “Ontology Mapping—A User Survey”, Proceedings of the Second International Workshop on Ontology Mapping, 2007. • [Gatterbauer07] Gatterbauer, Bohunsky, Herzog and Pollak, “Towards Domain-Independent Information Extraction from Web Tables”, Proceedings of the Sixteenth International World Wide Web Conference, 2007. • [Jha07] Jha and Nagy, “Wang Notation Tool: Layout Independent Representation of Tables”, Proceedings of the 19th International Conference on Pattern Recognition, 2007. • [Pivk07] Pivk, Sure, Cimiano, Gams, Rajkovič and Studer, “Transforming Arbitrary Tables into Logical Form with TARTAR”, Data & Knowledge Engineering, 2007. • [Poggi08] Poggi, Lembo, Calvanese, DeGiacomo, Lenzerini and Rosati, “Linking Data to Ontologies”, Journal on Data Semantics, 2008. • [Tao07] Tao and Embley, “Automatic Hidden-Web Table Interpretation by Sibling page Comparison”, Proceedings of the 26th International Conference on Conceptual Modeling, 2007. • [Tao09] Tao, Embley and Liddle, “Enabling a Web of Knowledge”, Technical Report : tango.byu.edu/papers, 2009. • [Xu06] Xu and Embley, “A Composite Approach to Automating Direct and Indirect Schema Mappings”, Information Systems, 2006.