910 likes | 1.04k Views
WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%.
E N D
WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA
A Web of Pages A Web of Facts • Birthdate of my great grandpa Orson • Price and mileage of red Nissans, 1990 or newer • Location and size of chromosome 17 • US states with property crime rates above 1%
Toward a Web of Knowledge • Fundamental questions • What is knowledge? • What are facts? • How does one know? • Philosophy • Ontology • Epistemology • Logic and reasoning
Ontology • Existence asks “What exists?” • Concepts, relationships, and constraints with formal foundation
Epistemology • The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?” • Populated conceptual model
Logic and Reasoning • Principles of valid inference – asks: “What is known?” and “What can be inferred?” • For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer
Making this Work How? • Distill knowledge from the wealth of digital web data • Annotate web pages • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Annotation Annotation … … Fact Fact Fact
Turning Raw Symbols into Knowledge • Symbols: $ 11,500 117K Nissan CD AC • Data: price(11,500) mileage(117K) make(Nissan) • Conceptualized data: • Car(C123) has Price($11,500) • Car(C123) has Mileage(117,000) • Car(C123) has Make(Nissan) • Car(C123) has Feature(AC) • Knowledge • “Correct” facts • Provenance
Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Explanation: How it Works • Extraction Ontologies • Semantic Annotation • Free-Form Query Interpretation
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
Extraction Ontologies Data Frame: Internal Representation: float Values External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Left Context: $ Key Word Phrase Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Generality & Resiliency ofExtraction Ontologies • Generality: assumptions about web pages • Data rich • Narrow domain • Document types • Single-record documents (hard, but doable) • Multiple-record documents (harder) • Records with scattered components (even harder) • Resiliency: declarative • Still works when web pages change • Works for new, unseen pages in the same domain • Scalable, but takes work to declare the extraction ontology
Free-Form Query Interpretation • Parse Free-Form Query (with respect to data extraction ontology) • Select Ontology • Formulate Query Expression • Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ” price mileage red Nissan 1996 or newer >=Operator
Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Formulate Query Expression • Conjunctive queries and aggregate queries • Mentioned object sets are all of interest. • Values and operator keywords determine conditions. • Color = “red” • Make = “Nissan” • Year >= 1996 >= Operator
Formulate Query Expression For Let Where Return
Great!But Problems Still Need Resolution • How do we create extraction ontologies? • Manual creation requires several dozen person hours • Semi-automatic creation • TISP (Table Interpretation by Sibling Pages) • TANGO (Table ANalysis for Generating Ontologies) • Nested Schemas with Regular Expressions • Synergistic Bootstrapping • Form-based Information Harvesting • How do we scale up? • Practicalities of technology transfer and usage • Millions of queries over zillions of facts for thousands of ontologies
Manual Creation • Library of instance recognizers • Library of lexicons
Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations
Recognize Tables Layout Tables (discard) Data Table Nested Data Tables
Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2
Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1 2
Locate Table Values Value
Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2
Interpretation Technique:Sibling Page Comparison Almost Same
Interpretation Technique:Sibling Page Comparison Different Same
Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment
Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) • Recognize and normalize table information • Construct mini-ontologies from tables • Discover inter-ontology mappings • Merge mini-ontologies into a growing ontology
Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10% Construct Mini-Ontology
BootstrappingCost-effective and Accurate Extraction • Focus on semi-structured elements first • Bootstrap synergistically • Extract from semi-structured elements • Learn extraction ontologies • Extract from plain text
newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline OCR