690 likes | 877 Views
Knowledge Acquisition on the Web. Growing the amount of available knowledge from within. Christopher Thomas. Overview. Knowledge Representation GlycO – Complex Carbohydrates domain ontology Information Extraction Taxonomy creation (Doozer/ Taxonom.com ) Fact Extraction (Doozer++)
E N D
Knowledge Acquisition on the Web Growing the amount of available knowledge from within Christopher Thomas
Overview • Knowledge Representation • GlycO – Complex Carbohydrates domain ontology • Information Extraction • Taxonomy creation (Doozer/Taxonom.com) • Fact Extraction (Doozer++) • Validation
Circle of knowledge on the Web Suggest new propositions Background knowledge Confirm new knowledge
Goal: Harness the Wisdom of the Crowds to automatically model a domain, verify the model and give the verified knowledge back to the community
Circle of knowledge on the Web What is knowledge? Suggest new propositions Background knowledge How do we turn propositions/beliefs into knowledge? Confirm new knowledge How do we acquire knowledge?
Background Knowledge [15] Christopher Thomas and AmitSheth, “On the Expressiveness of the Languages for the Semantic Web–Making a Case for ‘A Little More,’”inFuzzy Logic and the Semantic Web, Eli Sanchez (Ed.), Elsevier, 2006. [11] AmitSheth, Cartic Ramakrishnan, and Christopher Thomas, “Semantics for The Semantic Web: the Implicit, the Formal and the Powerful,”International Journal on Semantic Web & Information Systems, 1 (no. 1), 2005, pp. 1–18.
Different Angles • Social construction • Large scale creation of knowledge vs. • Small communities define their domains • Normative vs. Descriptive • Top-Down vs. Bottom-Up • Formal vs. Informal • Machine-readable vs. human-readable
Community-created knowledge • Descriptive • Bottom-up • Formally less rigid • May contain false information • If a statement in the world is in conflict with the Ontology, both may be wrong or both may be right • Good for broad, shallow domains • Good for human processing and IR tasks
Wikipedia and Linked Open Data • Created by large communities • Constantly growing • Domains within the linked data are not always easily discernible • Contain few axioms and restrictions • Little value to evaluation using logics
Formal - Modeling deep domains • Prescriptive / Normative • Top-down • Contains “true knowledge” • If a statement in the world is in conflict with the Ontology, the statement is false • Good for scientific domains • Good for computational reasoning/inference • Usually created by small communities of experts • Usually static, little change is expected
Example: GlycO • Created in collaboration with the Complex Carbohydrate Research Center at the University of Georgia on an NCRR grant. • Deep modeling of glycan structures and metabolic pathways [6] Christopher Thomas, Amit P. Sheth, and William S. York, “Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain,”inFormal Ontology in Information Systems (FOIS 2006) [5] Satya S. Sahoo, Christopher Thomas, Amit P. Sheth, William York, and SamirTartir, “Knowledge Modeling and Its Application in Life Sciences: A Tale of Two Ontologies,”15th International World Wide Web Conference (WWW2006),
N-Glycosylation metabolic pathway N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4 GNT-Vattaches GlcNAc at position 6 N-acetyl-glucosaminyl_transferase_V UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 GNT-Iattaches GlcNAc at position 2
Glycan Structures for the ontology • Import structures from heterogeneous databases • Possible connections modeled in the form of GlycoTree • Match structures to archetypes b-D-Manp-(1-6)+ | b-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc | b-D-Manp-(1-3)+ N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251
Interplay of extraction and evaluation • Errors in the source databases are propagated through various new databases comparing multiple sources fails for error correction • Less than 2% of incorrect information makes a database useless for automatic validation of hypotheses • The ontology contains rules on how carbohydrate structures are known to be composed • By mapping information in databases to the ontology and analyzing how successful the mapping was, we can identify possible errors.
Database Verification using GlycO b-D-Manp-(1-6)+ | a-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc | b-D-Manp-(1-3)+ a-D-Manp-(1-4) is not part of the identified canonical structure for N-Glycans, hence it is likely that the database entry is incorrect N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251
Pathway Steps - Reaction Evidence for this reaction from three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia
Summary - GlycO • The amount of accuracy and detail that can be found in ontologies such as GlycO could most likely not be acquired automatically • Only a small community of experts has the depth of knowledge to model such scientific ontologies
Summary - GlycO • However, the automatic population shows that a highly restrictive, expert-created rule set allows for automation or involvement of larger communities. • Frame-based population of knowledge • The formal knowledge encoded in the ontology serves to acquire new knowledge • The circle is completed
Summary Background Knowledge • Large amounts of information and knowledge are available • Some machine readable by default • Others need specific algorithms to extract information • The more available information we can use, the better the extraction of new information will be.
Circle of knowledge on the Web What is knowledge? Suggest new propositions Background knowledge How do we turn propositions into knowledge? Confirm new knowledge Part 2 How do we acquire knowledge?
Knowledge Acquisition through Model Creation [2] [3] [1] [3] Christopher Thomas, PankajMehra, Roger Brooks and AmitSheth. Growing Fields of Interest -Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence 2008, pp. 496-502 [2] Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, PankajMehra and AmitSheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, WebScience 2010 [1] Christopher Thomas, PankajMehra, Wenbo Wang, AmitSheth, Gerhard Weikum and Victor Chana Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Knoesis Center technical report.
First create a domain hierarchy Example: a hierarchy for the domain of Human Performance and Cognition
Expert evaluation of facts in the ontology 1-2: Information that is overall incorrect 3-4: Information that is somewhat correct 5-6: Correct general Information 7-9: Correct Information not commonly known
Step 1 • Domain hierarchy creation • Input terms e.g. related to Human Performance and Cognition • Hierarchy is automatically carved from articles and categories on Wikipedia
Overview - conceptual • Expand and Reduce approach • Start with ‘high recall’ methods • Exploration - Full text search • Exploitation – Node Similarity Method • Category growth • End with “high precision” methods • Apply restrictions on the concepts found • Remove unwanted terms and categories
Expand - conceptually Graph-based expansion Full text search on Article texts Delete results with low confidence score
Step 2: Pattern-Based Relationship Extraction Extracting meaningful relationships by macro-reading free text
Extracting from Plain text or hypertext • Informal, human-readable presentation of information • Vast amounts of information available • Web • Scientific publications • Encyclopediae • Need sophisticated algorithms to extract information
Pattern-based Fact Extraction • Learn textual patterns that express known relationship types • Search the text corpus for occurrences of known entities (e.g. from domain hierarchy) • Semi-open • Types are known and limited • Types are automatically expanded when LOD grows • Vector-Space Model • Probabilistic representation
Training • Relationship data in the UMLS Metathesaurus or the Wikipedia Infobox-data provide a large set of facts in RDF Triple format • Limited set of relationships that can be arranged in a schema • Semi-open • Types are known and limited • Types are automatically expanded when LOD grows
Training procedure • Iterate through all facts (S->P->O triples) • Find evidence for the fact in a corpus • Wikipedia, WWW, PubMed or any other collection • If triple subject and triple object occur in close proximity in text, add the pattern in-between to the learned patterns • Combined evidence from many different patterns increases the certainty of a relationship between the entities
Overview – initial computations * Modifications CP2P R2P CP2P CP2Pmod R2P R2Pmod * Entropy SVD/LSI Pertinence Fact Collection Text Corpus Matrix Computations
Training procedure cont’d Canberra, theAustralian capital city Canberra, capital of theCommonwealth ofAustralia Canberra, theAustralian capital Canberra, theAustraliancapital city <Subject>, the<Object> capital city <Subject>, capital of the Commonwealth of <Object> <Subject>, the <Object> capital 1 1 1
Relationship Patterns Extracted Synonyms Generalize
Advanced Computations * Modifications CP2P R2P CP2P CP2Pmod R2P R2Pmod * SVD/LSI Pertinence Fact Collection Text Corpus Entropy Matrix Computations
Advanced Computations R2P R2Pmod * SVD/LSI Pertinence Entropy Matrix Computations • LSI to determine relationship similarities • Reduces sparsity in the matrix and makes relationship rows more comparable • Allows better use of pertinence computation • Entropy • Increase weights for more unique patterns • Pertinence • Smoothing of pattern occurrence frequencies
Pertinence for Relations • Looking at fact extraction as a classification of concept pairs into classes of relations • Class boundaries are not clear cut • E.g. has_physical_parthas_part • don’t punish the occurrence of the same pattern with relationship types that are similar