1 / 77

WoK: A Web of Knowledge

WoK: A Web of Knowledge. David W. Embley Brigham Young University Provo, Utah, USA. A Web of Pages  A Web of Facts. Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%.

Download Presentation

WoK: A Web of Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA

  2. A Web of Pages  A Web of Facts • Birthdate of my great grandpa Orson • Price and mileage of red Nissans, 1990 or newer • Location and size of chromosome 17 • US states with property crime rates above 1%

  3. Toward a Web of Knowledge • Fundamental questions • What is knowledge? • What are facts? • How does one know? • Philosophy • Ontology • Epistemology • Logic and reasoning

  4. Ontology • Existence  asks “What exists?” • Concepts, relationships, and constraints with formal foundation

  5. Epistemology • The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” • Populated conceptual model

  6. Logic and Reasoning • Principles of valid inference – asks: “What is known?” and “What can be inferred?” • For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer

  7. Making this Work  How? • Distill knowledge from the wealth of digital web data • Annotate web pages • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Annotation Annotation … … Fact Fact Fact

  8. Turning Raw Symbols into Knowledge • Symbols: $ 11,500 117K Nissan CD AC • Data: price(11,500) mileage(117K) make(Nissan) • Conceptualized data: • Car(C123) has Price($11,500) • Car(C123) has Mileage(117,000) • Car(C123) has Make(Nissan) • Car(C123) has Feature(AC) • Knowledge • “Correct” facts • Provenance

  9. Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.

  10. Data Extraction Demo

  11. Semantic Annotation Demo

  12. Free-Form Query Demo

  13. Explanation: How it Works • Extraction Ontologies • Semantic Annotation • Free-Form Query Interpretation

  14. Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

  15. Extraction Ontologies Data Frame: Internal Representation: float Values External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Left Context: $ Key Word Phrase Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

  16. Generality & Resiliency ofExtraction Ontologies • Generality: assumptions about web pages • Data rich • Narrow domain • Document types • Single-record documents (hard, but doable) • Multiple-record documents (harder) • Records with scattered components (even harder) • Resiliency: declarative • Still works when web pages change • Works for new, unseen pages in the same domain • Scalable, but takes work to declare the extraction ontology

  17. Semantic Annotation

  18. Free-Form Query Interpretation • Parse Free-Form Query (with respect to data extraction ontology) • Select Ontology • Formulate Query Expression • Run Query Over Semantically Annotated Data

  19. Parse Free-Form Query “Find me the and of all s – I want a ” price mileage red Nissan 1996 or newer >=Operator

  20. Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

  21. Formulate Query Expression • Conjunctive queries and aggregate queries • Mentioned object sets are all of interest. • Values and operator keywords determine conditions. • Color = “red” • Make = “Nissan” • Year >= 1996 >= Operator

  22. Formulate Query Expression For Let Where Return

  23. Run QueryOver Semantically Annotated Data

  24. Great!But Problems Still Need Resolution • How do we create extraction ontologies? • Manual creation requires several dozen person hours • Semi-automatic creation • TISP (Table Interpretation by Sibling Pages) • TANGO (Table ANalysis for Generating Ontologies) • Nested Schemas with Regular Expressions • Synergistic Bootstrapping • Form-based Information Harvesting • How do we scale up? • Practicalities of technology transfer and usage • Millions of queries over zillions of facts for thousands of ontologies

  25. Manual Creation

  26. Manual Creation

  27. Manual Creation • Library of instance recognizers • Library of lexicons

  28. Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations

  29. Recognize Tables Layout Tables (discard) Data Table Nested Data Tables

  30. Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2

  31. Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1 2

  32. Locate Table Values Value

  33. Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2

  34. Interpretation Technique:Sibling Page Comparison

  35. Interpretation Technique:Sibling Page Comparison Same

  36. Interpretation Technique:Sibling Page Comparison Almost Same

  37. Interpretation Technique:Sibling Page Comparison Different Same

  38. Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout  discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment

  39. Generated RDF

  40. WoK Demo (via TISP)

  41. Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) • Recognize and normalize table information • Construct mini-ontologies from tables • Discover inter-ontology mappings • Merge mini-ontologies into a growing ontology

  42. Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

  43. Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10% Construct Mini-Ontology

  44. Discover Mappings

  45. Merge

  46. Semi-Automatic Annotation viaSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions) • Build a page-layout, pattern-based annotator • Automate layout recognition based on examples • Auto-generate examples with extraction ontologies • Synergistically run pattern-based annotator & extraction-ontology annotator

  47. PatML Editor Information Structure Tree Page Source Text Browser-Rendered Page

  48. Synergistic Execution Extraction Ontology Partially Annotated Document Conceptual Annotator (ontology-based annotation) Pattern Generation Document Layout Patterns Annotated Document Structural Annotator (layout-driven annotation)

  49. Form-Based Information Harvesting • Forms • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Instance recognizers • Some pre-existing instance recognizers • Lexicons • Automated extraction ontology creation?

More Related