530 likes | 645 Views
Wikitology: A Wikipedia Derived Knowledge Base. Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009. Outline. Introduction and Motivation Related Work Proposed Work Timeline Work Progress Conclusion. Introduction. Wikipedia Encyclopedia Developed Collaboratively
E N D
Wikitology: A Wikipedia Derived Knowledge Base Zareen Syed Advisor: Dr. Tim Finin February 6th, 2009
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress • Conclusion
Introduction • Wikipedia • Encyclopedia • Developed Collaboratively • Freely available online • Millions of articles • English Wikipedia (2,723,767 articles) • Multiple Languages (More than 260) • Structured and un-structured content
Introduction • Wikipedia Content and Organization • Article Text • Categories and Category Hierarchy • Inter-article Links • Info-boxes • Disambiguation Pages • Redirection Pages • Talk Pages • History Pages • Meta-data
Motivation Challenges • Human Understandable Content (not machine readable) • How to make it more structured and organized to improve machine readability • How to automatically exploit the knowledge in Wikipedia to solve some real world problems
Thesis Statement • We can exploit Wikipedia and other related knowledge sources to automatically create knowledge about the world supporting a set of common use cases such as: • Concept Prediction • Information Retrieval • Information Extraction
Proposed Contributions • Developing a Novel Hybrid Knowledge Base composed of structured, semi-structured and un-structured information extracted from Wikipedia and other related sources • Developing Novel Application Specific Algorithms for exploiting the hybrid knowledge base • Task Based Evaluation of the system on common use-cases such as Concept Prediction, Information Retrieval and Information Extraction
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress • Conclusion
Related Work • Information Extraction • Relation extraction [35] • Co-reference resolution [25] • Named Entity Classification [52] • Natural Language Processing • Automatic word sense disambiguation [27] • Searching synonyms [28]
Related Work • Information Retrieval • Text categorization [24] • Computing semantic relatedness [30,31,32] • Predicting document topics [26] • Search Engine [69] • Semantic Web • DBPedia [46] • Semantic MediaWiki [46] • Linked Open Data Project [23] • Freebase [22]
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress
Proposed Work • Refining, Enriching and Exploiting Structured Content in Wikipedia • Integrating other related knowledge sources • Developing application specific algorithms • Developing a dynamic and scalable architecture
Issues • Single document in too many categories: • George W. Bush is included in about 30 categories • Links between articles belonging to very different categories • John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points. • Number of articles with in a category: • Some categories are under represented where as others have many articles • Administrative Categories • For eg: • Clean up from Sep 2006 • Articles with unsourced statements • Links to words in an article • For eg. If the word United States appears in the document then that word might be linked to the page on “United States”
Issues • Category Hierarchy: • Multiple Parents (Thesaurus) • Noisy • “Animals” category defined in the sub-tree rooted at “People” • loose subsumption • Geography-> Geography by place -> Regions-> Regions of Asia->Middle East -> Dances of Middle East • Events->Events by year->Lists of leaders by year
Refining, Enriching and Exploiting Structured Content in Wikipedia Category Hierarchy • Filtering out Administrative Categories • Algorithms for Selecting and Ranking Categories • Inferring and Labeling Semantic Relations between Categories • Refining Subsumption (Taxonomy) • Instance-of Relation • Using Information in Wikipedia Lists_of_Topics • Using Specific Administrative Categories Done
Refining, Enriching and Exploiting Structured Content in Wikipedia Inter-Article Links: • Problem: Don’t imply semantic relatedness • Links to locations, term definitions, dates, entities • Possible solutions: • Classifying Link Types • Introducing Link Weights Done
Refining, Enriching and Exploiting Structured Content in Wikipedia Redirection Pages • Disambiguation pages
Proposed Work • Exploring • Other structured content • Talk pages, user pages, history pages and meta-data • Other structured resources • Integrating structured information from other sources like DBpedia and Freebase in Wikitology • How and When to employ reasoning over the RDF triples
Proposed Work • Developing Novel Application Specific Algorithms on top of the hybrid Wikitology Knowledge Base for applications such as • Concept Prediction • Information Retrieval • Information Extraction
Proposed Work • Evaluation • Main Approaches to Evaluating Ontologies • Gold Standard Evaluation (Comparison to an existing Ontology) • Criteria based Evaluation (By humans) • Task based Evaluation (Application based) • Comparison with Source of Data (Data driven) • Using a Reasoning Engine • Our Approach to Evaluation • Task based Evaluation (Application based)
Wikitology Overview Articles IR Index Application Specific Algorithms Category Links Hierarchical Graph Wikitology Code Application Specific Algorithms Page Links Graph RDF Reasoner Application Specific Algorithms Triple Store Relational Database
Outline • Introduction and Motivation • Related Work • Proposed Work • Time Line • Work Progress • Conclusion
Outline • Introduction and Motivation • Related Work • Proposed Work • Time Line • Work Progress • Conclusion
Work Done • Case Study 1: • Concept Prediction • Case Study 2: • Document Expansion for Information Retrieval • Case Study 3: • Named Entity Classification • Case Study 4: • Co-reference Resolution • Case Study 5: • Concept Based Features for Information Retrieval In Progress
Case Study 1 Concept Prediction [2] Problem: Predict the individual document topics as well as concepts common to a set of documents Approach: • Hybrid Knowledge base: Wikitology 1.0 • Algorithms for selecting and aggregating terms
Wikitology 1.0 • Wikipedia as an Ontology • Each article is a concept in the ontology • Terms linked via Wikipedia’s category system and inter-article links • It’s a consensus ontology created, kept current and maintained by a diverse community • Overall content quality is high • Terms have unique IDs (URLs) and are “self describing” for people
Wikitology 1.0 • Structured Data • Specialized Concepts (article titles) • Generalized Concepts (category titles) • Inter-category and Inter-article links as relations between concepts • Article-Category links as relations between specialized and generalized concepts • Un-Structured Data • Article Text ( A way to map ontology terms to free text) • Algorithms • Algorithms to select, rank and aggregate concepts using the hybrid knowledge base
Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Input Querydoc(s) similar to 0.8 Similar Wikipedia Articles 0.2 0.1 Cosine similarity 0.2 0.3
Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Wikipedia Category Graph Input Querydoc(s) similar to 0.8 Similar Wikipedia Articles 0.2 0.1 Cosine similarity 0.2 0.3
Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Output Rank Categories Links Cosine similarity Wikipedia Category Graph 0.9 3 Input Querydoc(s) similar to 0.8 Similar Wikipedia Articles 0.2 0.1 Cosine similarity 0.2 0.3
Method 2 Using Spreading Activation on Category Links Graph to get Aggregated Concepts Spreading Activation Output Ranked Concepts based on Final Activation Score Wikipedia Category Graph Input Querydoc(s) Similar to 0.8 0.2 0.1 Input Function Cosine similarity 0.2 0.3 Output Function
Method 3 Using Spreading Activation on Article Links Graph Input Threshold: Ignore Spreading Activation to articles with less than 0.4 Cosine similarity score Querydoc(s) Similar To Edge Weights: Cosine similarity between linkedarticles Wikipedia Article Links Graph Spreading Activation Node Input Function Output Node Output Function Ranked Concepts based on Final Activation Score
Wikitology 1.0 • The system was evaluated by predicting the categories and article links of existing Wikipedia articles and comparing with the ground truth • It was observed that Wikitology 1.0 system was able to predict the document topics and common concepts with high accuracy when the article concepts were well represented within Wikipedia
Case Study 2 Document Expansion with Wikipedia Derived Ontology Terms [21]* Preliminary work with TREC documents Doc: FT921-4598 (3/9/92) ... Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself ... Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence IR Effectiveness Using Wikipedia Concepts * In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory * In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory
Case Study 3Named Entity Classification • Semi-automated generation of Training data • Persons, Locations and Events • Experimenting with different feature sets • Inter-article link labeling Results showing accuracy obtained using different feature sets
Case Study 4 Cross Document Entity Co-reference Resolution [21]* Problem: • To determine whether various named people, organizations or relations from different documents refer to the same object in the world. • For example, does the “Condoleezza Rice” mentioned in one document refer to the same person as the “Secretary Rice” from another? * In Collaboration with John Hopkins University Human Language Technology Center of Excellence
<DOC> <DOCNO>ABC19980430.1830.0091.LDC2000T44-E2</DOCNO> <TEXT> Webb Hubbell PER Individual NAM: "Hubbell" "Hubbells" "Webb Hubbell" "Webb_Hubbell" NOM: "Mr . " "friend" "income" PRO: "he" "him" "his" , . abc's accountant after again ago all alleges alone also and arranged attorney avoid been b efore being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough eva sion feel financial firm first four friend friends going got grand happening has he help him his hope house hubbell hubbells hundred hush income increase independent indict indicted indictme nt inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady la te law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutor s questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years </TEXT> </DOC> Entity documents capture information about entities extracted from documents, including mention strings, type and subtype, and text surrounding the mentions. Entity Document (EDOC)
Wikitology 2.0 • Enhancements • Structured Data • Specialized Concepts (article titles) • Generalized Concepts (category titles) • Inter-category and Inter-article links as relations between concepts • Article-Category links as relations between specialized and generalized concepts • YAGO types (to identify entity type) • Table with Disambiguation set (to identify highly confused entities) • Aliases using Redirect pages • Un-Structured Data • Article Text • Redirect titles (added to article text)
Wikitology 2.0 Data Structures • Lucene Index • Concept Title + Redirected Titles (field) • Article Text + Redirected Titles (field) • RDF field with Entity Type (YAGO type) • Graphs • Category links graph • Article links graph • Article-Category links • Tables • Disambiguation Set derived from disambiguation pages
Wikitology 2.0 • Custom Query Front end • The EDOC’s name mention strings • Wikitology’s title field • slightly higher weight to the longest mention, i.e., “Webb Hubbell” • The EDOC type • RDF Field: Yago Type • Name mention strings + Contextual text • Text (Wikitology Article Contents)
Article Vector for ABC19980430.1830.0091.LDC2000T44-E2 • 1.0000 Webster_Hubbell • 0.3794 Hubbell_Trading_Post_National_Historic_Site • 0.3770 United_States_v._Hubbell • 0.2263 Hubbell_Center • 0.2221 Whitewater_controversy • Category Vector for ABC19980430.1830.0091.LDC2000T44-E2 • 0.2037 Clinton_administration_controversies • 0.2037 American_political_scandals • 0.2009 Living_people • 0.1667 1949_births • 0.1667 People_from_Arkansas • 0.1667 Arkansas_politicians • 0.1667 American_tax_evaders • 0.1667 Arkansas_lawyers Each entity document is tagged by Wikitology, producing vectors of article and category tags. Note the clear match with a known person in Wikipedia. Wikitology Features
Features Derived from Wikitology 2.0 Name Range Type Description APL20WAS {0,1} sim 1 if the top article tags for the two entities are identical, 0 otherwise APL21WCS {0,1} sim 1 if the top category tags for the two entities are identical, 0 otherwise APL22WAM [0..1] sim The cosine similarity of the medium length article vectors (N=5) for the two entities APL23WcM [0..1] sim The cosine similarity of the medium length category vectors (N=4) for the two entities APL24WAL [0..1] sim The cosine similarity of the long length article vectors (N=8) for the two entities APL31WAS2 [0..1] sim match of entities top Wikitology article tag, weighted by avg(score1,score2) APL32WCS2 [0..1] sim match of entities top Wikitology category tag, weighted by avg(score1,score2) APL26WDP {0,1} dissim 1 if both entities are of type PER and their top article tags are different, 0 otherwise APL27WDD {0,1} dissim 1 if the two top article tags are members of the same disambiguation set, 0 otherwise APL28WDO {0,1} dissim 1 if both entities are of type ORG and their top article tags are different, 0 otherwise APL29WDP2 [0..1] dissim Match both entities are of type PER and their top article tags are different, weighted by 1- abs(score1-score2), 0 otherwise APL30WDP2 [0..1] dissim Match if both entities are of type ORG and their top article matches are different organizations, weighted by 1-abs(score1-score2), 0 otherwise Twelve features were computed for each pair of entities using Wikitology, seven aimed at measuring their similarity and five for measuring their dissimilarity.
Evaluation Evaluation results for cross-document entity co-reference task using Wikitology features
Case Study 5Feature Generation to Improve Information Retrieval Performance* • Incorporating Generalized Concept Features in MORAG [69] search engine * Work being done during internship at RiverGlass Company
MORAG Search Engine • Concept features generated using Wikipedia (ESA) • Feature Selection using pseudo-relevance feedback • Merged Ranking of Concept scores and BOW scores Incorporating Wikitology based features in MORAG search engine
Outline • Introduction and Motivation • Related Work • Proposed Work • Timeline • Work Progress • Conclusion
Thesis Statement • We can exploit Wikipedia and other related knowledge sources to automatically create knowledge about the world supporting a set of common use cases such as: • Concept Prediction • Information Retrieval • Information Extraction
Proposed Contributions • Developing a Novel Hybrid Knowledge base composed of structured and un-structured information extracted from Wikipedia and other related sources • Wikitology 1.0 • Wikitology 2.0
Proposed Contributions • Developing Novel Application Specific Algorithmsfor exploiting the hybrid knowledge base • Methods for Concept Prediction • Ranking methods and Spreading Activation • Co-reference Resolution • Novel Entity representation and Hybrid Querying • Information Retrieval • Document Expansion, Generalized Concept Features augmentation