770 likes | 899 Views
WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%.
E N D
WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA
A Web of Pages A Web of Facts • Birthdate of my great grandpa Orson • Price and mileage of red Nissans, 1990 or newer • Location and size of chromosome 17 • US states with property crime rates above 1%
Toward a Web of Knowledge • Fundamental questions • What is knowledge? • What are facts? • How does one know? • Philosophy • Ontology • Epistemology • Logic and reasoning
Ontology • Existence asks “What exists?” • Concepts, relationships, and constraints with formal foundation
Epistemology • The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?” • Populated conceptual model
Logic and Reasoning • Principles of valid inference – asks: “What is known?” and “What can be inferred?” • For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer
Making this Work How? • Distill knowledge from the wealth of digital web data • Annotate web pages • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Annotation Annotation … … Fact Fact Fact
Turning Raw Symbols into Knowledge • Symbols: $ 11,500 117K Nissan CD AC • Data: price(11,500) mileage(117K) make(Nissan) • Conceptualized data: • Car(C123) has Price($11,500) • Car(C123) has Mileage(117,000) • Car(C123) has Make(Nissan) • Car(C123) has Feature(AC) • Knowledge • “Correct” facts • Provenance
Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Explanation: How it Works • Extraction Ontologies • Semantic Annotation • Free-Form Query Interpretation
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
Extraction Ontologies Data Frame: Internal Representation: float Values External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Left Context: $ Key Word Phrase Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Generality & Resiliency ofExtraction Ontologies • Generality: assumptions about web pages • Data rich • Narrow domain • Document types • Single-record documents (hard, but doable) • Multiple-record documents (harder) • Records with scattered components (even harder) • Resiliency: declarative • Still works when web pages change • Works for new, unseen pages in the same domain • Scalable, but takes work to declare the extraction ontology
Free-Form Query Interpretation • Parse Free-Form Query (with respect to data extraction ontology) • Select Ontology • Formulate Query Expression • Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ” price mileage red Nissan 1996 or newer >=Operator
Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Formulate Query Expression • Conjunctive queries and aggregate queries • Mentioned object sets are all of interest. • Values and operator keywords determine conditions. • Color = “red” • Make = “Nissan” • Year >= 1996 >= Operator
Formulate Query Expression For Let Where Return
Great!But Problems Still Need Resolution • How do we create extraction ontologies? • Manual creation requires several dozen person hours • Semi-automatic creation • TISP (Table Interpretation by Sibling Pages) • TANGO (Table ANalysis for Generating Ontologies) • Nested Schemas with Regular Expressions • Synergistic Bootstrapping • Form-based Information Harvesting • How do we scale up? • Practicalities of technology transfer and usage • Millions of queries over zillions of facts for thousands of ontologies
Manual Creation • Library of instance recognizers • Library of lexicons
Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations
Recognize Tables Layout Tables (discard) Data Table Nested Data Tables
Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2
Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1 2
Locate Table Values Value
Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2
Interpretation Technique:Sibling Page Comparison Almost Same
Interpretation Technique:Sibling Page Comparison Different Same
Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment
Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) • Recognize and normalize table information • Construct mini-ontologies from tables • Discover inter-ontology mappings • Merge mini-ontologies into a growing ontology
Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10% Construct Mini-Ontology
Semi-Automatic Annotation viaSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions) • Build a page-layout, pattern-based annotator • Automate layout recognition based on examples • Auto-generate examples with extraction ontologies • Synergistically run pattern-based annotator & extraction-ontology annotator
PatML Editor Information Structure Tree Page Source Text Browser-Rendered Page
Synergistic Execution Extraction Ontology Partially Annotated Document Conceptual Annotator (ontology-based annotation) Pattern Generation Document Layout Patterns Annotated Document Structural Annotator (layout-driven annotation)
Form-Based Information Harvesting • Forms • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Instance recognizers • Some pre-existing instance recognizers • Lexicons • Automated extraction ontology creation?