1 / 67

Wikitology Wikipedia as an Ontology

Wikitology Wikipedia as an Ontology. Tim Finin, UMBC. Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko JHU Human Language Technology Center of Excellence. Overview.

hidi
Download Presentation

Wikitology Wikipedia as an Ontology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WikitologyWikipedia as an Ontology Tim Finin, UMBC Zareen Syed and Anupam Joshi University of Maryland, Baltimore County James Mayfield, Paul McNamee and Christine Piatko JHU Human Language Technology Center of Excellence

  2. Overview introduction  wikitology  applications  discussion  conclusion Introduction Wikipedia as an ontology Applications Discussion Conclusion

  3. Wikis and Knowledge introduction wikitology  applications  discussion  conclusion • Wikis are a great way to collaborate on knowledge encoding • Wikipedia is an archetype for this, but thereare many examples • Ongoing research is exploring how to integrate this with structured knowledge • DBpedia, Semantic Media Wiki, Freebase, etc. • I’ll describe an approach we’ve taken and experiments in using it • We came at this from an IR/HLT perspective

  4. Wikipedia is the new black introduction wikitology  applications  discussion  conclusion Wikipedia is a great resource for knowledge Its being widely used as a source of knowledge for many systems Such as…

  5. Wikipedia data in RDF introduction wikitology  applications  discussion  conclusion

  6. Populating Freebase KB introduction wikitology  applications  discussion  conclusion

  7. Populating Powerset’s KB introduction wikitology  applications  discussion  conclusion

  8. AskWiki uses Wikipedia for QA introduction wikitology  applications  discussion  conclusion

  9. With sometimes surprising results introduction wikitology  applications  discussion  conclusion

  10. TrueKnowledge mines Wikipedia introduction wikitology  applications  discussion  conclusion

  11. Wikipedia pages as tags introduction wikitology  applications  discussion  conclusion

  12. Wikitology introduction wikitology applications  discussion  conclusion We are exploring an approach to deriving an ontology from Wikipedia that is useful in a variety of language processing tasks

  13. Our original problem (2006) • Problem: describe what an analyst has been working on to support collaboration • Idea: track documents she reads and map these to terms in an ontology, aggregate to produce a short list of topics • Approach: use Wikipedia articles as ontology terms, use document-article similarity for the mapping, and spreading activation for aggregation introduction wikitology applications  discussion  conclusion

  14. What’s a document about? Two common approaches: (1) Select words and phrases using TF-IDF that characterize the document (2) Map document to a list of terms from a controlled vocabulary or ontology (1) is flexible and does not require creating and maintaining an ontology (2) can tie documents to a rich knowledge base introduction wikitology applications  discussion  conclusion

  15. Wikitology ! • Using Wikipedia as an ontology offers the best of both approaches • each article (~3M) is a concept in the ontology • terms linked via Wikipedia’s category system (~200k) and inter-article links • Lots of structured and semi-structured data • It’s a consensus ontology created and maintained by a diverse community • Broad coverage, multilingual, very current • Overall content quality is high introduction wikitology applications  discussion  conclusion

  16. Wikitology features • Terms have unique IDs (URLs) and are “self describing” for people • Underlying graphs provide structure and associations: categories, article links, disambiguation, aliases (redirects), … • Article history contains useful meta-data for trust, provenance, controversy, … • External sources provide more info (e.g., Google’s PageRank) • Annotated with structured data from DBpedia, Freebase, Geonames & LOD introduction wikitology applications  discussion  conclusion

  17. Problems as an Ontology introduction wikitology applications  discussion  conclusion Treating Wikipedia as an ontology reveals many problems • Uncategorized and miscategorized articles • Single document in too many categories: • George W. Bush is included in about 30 categories • Links between articles belonging to very different categories • John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points

  18. Problems as an Ontology introduction wikitology applications  discussion  conclusion • Article links in text are not “typed” • Uneven category articulation • Some categories are under represented where as others have many articles • Administrative categories, e.g. • Clean up from Sep 2006 • Articles with unsourced statements • Over-linking, e.g. • A mention of United States linked to thepage United_states • Mentions of 1949 linked to the year 1949

  19. Problems as an Ontology introduction wikitology applications  discussion  conclusion Wikipedia’s infobox templates have great potential for have several problems • Multiple templates for same class • Multiple attribute names for same property • E.g., six attributes for a person’s birth date • Attributes lack domains or datatypes • E.g., value can be string or link

  20. Wikitology 1, 2, 3 introduction  wikitology applications discussion  conclusion We’ve addressed some of of these problems in developing Wikitology The development has been driven by several use cases and applications

  21. Wikitology Use Cases introduction  wikitology applications discussion  conclusion Identifying user context in a collaboration system from documents viewed (2006) Improve IR accuracy of by adding Wikitology tags to documents (2007) Cross document co-reference resolution for named entities in text (2008) Knowledge Base population from text (2009) Improve Web search engine by tagging documents and queries (2009)

  22. Wikitology 1.0 (2007) • Structured Data • Specialized concepts (article titles) • Generalized concepts (category titles) • Inter-category and -article links as relations between concepts • Article-category links as relations between specialized and generalized concepts • Un-Structured Data • Article text • Algorithms to remove useless categor-ies and links, infer categories, and select, rank and aggregate concepts using the hybrid knowledge base text graphs Human input& editing introduction  wikitology applications discussion  conclusion

  23. Experiments • Goal: given one or more documents, compute a ranked list of the top Wikipedia articles and/or categories that describe it. • Basic metric: document similarity between Wikipedia article and document(s) • Variations: role of categories, eliminating uninteresting articles, use of spreading activation, using similarity scores, weighing links, number of spreading activation pulses, individual or set of query documents, etc, etc. introduction  wikitology applications discussion  conclusion

  24. Method 1 Using Wikipedia article text & categories to predict concepts Input Querydoc(s) similar to 0.8 Similar Wikipedia Articles 0.2 0.1 Cosine similarity 0.2 introduction  wikitology applications discussion  conclusion

  25. Method 1 Using Wikipedia article text & categories to predict concepts Wikipedia Category Graph Input Querydoc(s) similar to 0.8 Similar Wikipedia Articles 0.2 0.1 Cosine similarity 0.2 0.3 introduction  wikitology applications discussion  conclusion

  26. Method 1 Using Wikipedia article text & categories to predict concepts Output Rank Categories Links Cosine similarity Wikipedia Category Graph 0.9 3 Input Querydoc(s) similar to 0.8 Similar Wikipedia Articles 0.2 0.1 Cosine similarity 0.2 0.3 introduction  wikitology applications discussion  conclusion

  27. Method 2 Using spreading activation on category link graph to get aggregated concepts Spreading Activation Output Ranked Concepts based on Final Activation Score Wikipedia Category Graph Input Querydoc(s) Similar to 0.8 0.2 0.1 Input Function Cosine similarity 0.2 0.3 Output Function introduction  wikitology applications discussion  conclusion

  28. Method 3 Using spreading activation on article link graph Input Threshold: Ignore Spreading Activation to articles with less than 0.4 Cosine similarity score Querydoc(s) Similar To Edge Weights: Cosine similarity between linkedarticles Wikipedia Article Links Graph Spreading Activation Node Input Function Output Node Output Function Ranked Concepts based on Final Activation Score

  29. Evaluation • An initial informal evaluation compared results against our own judgments • Used to select promising combinations of ideas and parameter settings • Formal evaluation: • Selected Wikipedia articles for testing; remove from Lucene index and graphs • For each, use methods to predict categories and linked articles • Compare results using precision and recall to known categories and linked articles introduction  wikitology applications discussion  conclusion

  30. Example Prediction for Set of Test Documents Test Document Titles in the Set: (Wikipedia Articles) Crop_rotation Permaculture Beneficial_insects Neem Lady_Bird Principles_of_Organic_Agriculture Rhizobia Biointensive Inter­cropping Green_manure Concept not in the Category Hierarchy

  31. Category prediction evaluation • Spreading activation with two pulses worked best • Only considering articles with similarity > 0.5 was a good threshold introduction  wikitology applications discussion  conclusion

  32. Article prediction evaluation • Spreading activation with one pulse worked best • Only considering articles with similarity > 0.5 was a good threshold introduction  wikitology applications discussion  conclusion

  33. Improving IR performance (2008-09) introduction  wikitology applications discussion  conclusion • Improving IR performance for a collection by adding semantic terms to documents • Query with blind relevance feedback may benefit from the semantic terms • Initial evaluation with NIST TREC 2005 collection in collaboration with Paul McNamee, JHU HLTCOE • Ongoing: integration into RiverGlass MORAG search engine

  34. Improving IR performance Doc: FT921-4598 (3/9/92) ... Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself ... Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence introduction  wikitology applications discussion  conclusion

  35. Evaluation introduction  wikitology applications discussion  conclusion • Mixed results on NIST evaluation • Slightly worse on mean average precision • Slightly better for precision at 10

  36. MORAG Search Engine • Concept features generated using Wikipedia (ESA) • Feature Selection using pseudo-relevance feedback • Merged Ranking of Concept scores and BOW scores introduction  wikitology applications discussion  conclusion

  37. Information Extraction • Problem: resolve entities found by a named entity recognition system across documents to a KB entries • ACE 2008: NIST run Automatic Extrac-tion Conference is focused on this task • We were part of a team lead by JHU Human Language Technology Center of Excellence • Use Wikitology to map document entities to KB entities introduction  wikitology applications discussion  conclusion

  38. Wikitology 2.0 (2008) RDF RDF graphs text Freebase KB Yago WordNet Databases Human input & editing

  39. Global Coreference Task • Start with entities and relations produced by a within document extraction system • Produce ‘Global’ clusters for PERSON and ORGANIZATION entities • Only evaluate over instances of entities with a name • Challenges: • Very limited development data • ACE released 49 files in English, none in Arabic • MITRE released English ACE05 corpus, but annotation is noisy and data has few ambiguous entities • Within document mistakes are propagated to cross-document system • 10K document evaluation set required work on scalability of approaches Abu Abbas aka Muhammad Zaydan aka Muhammad Abbas William Wallace (living British Lord) William Wallace (of Braveheart fame) introduction  wikitology applications discussion  conclusion

  40. Global Coreference Resolution Approach Document Entities: E1:Abu Abbas was arrested … E2: Palestinian President Mahmoud Abbas... E3: … election of Abu Mazen Filtered Pairs: E4: … president George Bush E1, E2 (shared word) E1, E3 (shared word) E2, E3 (known alias) Features: E1, E2: character overlap: 5 E1, E2: distinct Freebase entities: true E1, E3: character overlap: 3E1, E3: distinct Freebase entities: false …. Abu MazenMahmoud Abbas Entity Clusters: convicted terrorist Muhammed Abbas Abu Abbas Palestinian Leader • Serif for intra-document processing • Entity Filtering • Collect all pairs of SERIF entities • Filter entity pairs with heuristics (e.g., string similarity of mentions) to get high-recall set of pairs significantly smaller than n2 possible pairs • Feature generation • Training • Train SVM to identify coreferent pairs • Entity Clustering • Cluster predicted pairs • Each connected component forms a global entity • Relation Identification • Every pair of SERIF-identified relations whose types are identical and whose endpoints are coreferent are deemed to be coreferent introduction  wikitology applications discussion  conclusion

  41. Wikitology tagging introduction  wikitology applications discussion  conclusion • Using Serif’s output, we produced an entity document for each entity. Included the entity’s name, nominal and pronom-inal mentions, APF type and subtype, and words in a window around the mentions • We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity • We used the vectors to compute fea-tures measuring entity pair similarity/dissimilarity

  42. Entity Document & Tags Wikitology article tag vector Webster_Hubbell 1.000 Hubbell_Trading_Post National Historic Site 0.379 United_States_v._Hubbell 0.377 Hubbell_Center 0.226 Whitewater_controversy 0.222 Wikitology category tag vector Clinton_administration_controversies 0.204 American_political_scandals 0.204 Living_people 0.201 1949_births 0.167 People_from_Arkansas 0.167 Arkansas_politicians 0.167 American_tax_evaders 0.167 Arkansas_lawyers 0.167 <DOC> <DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO> <TEXT> Webb Hubbell PER Individual NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell" NAM: "Mr . " "friend” "income" PRO: "he” "him” "his" , . abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years </TEXT> </DOC>

  43. Wikitology derived features introduction  wikitology applications discussion  conclusion • Seven features measured entity similarity using cosine similarity of various length article or category vectors • Five features measured entity dissimilarity: • two PER entities match different Wikitology persons • two entities match Wikitology tags in a disambiguation set • two ORG entities match different Wikitology organizations • two PER entities match different Wikitology persons, weighted by 1-abs(score1-score2) • two ORG entities match different Wikitology orgs, weighted by 1-abs(score1-score2)

  44. Character-level features Exact Match of NAM mentions Longest mention exact match Some mention exact match Multiple mention exact match All mention exact match Partial Match Dice score, character bigrams Dice score, longest mention character bigrams Last word of longest string match Matching nominals and pronominals Exact match Multiple exact match All match Dice score of mention strings Document-level features Words Dice score, words in document Dice score, words around mentions Cosine score, words in document Cosine score, words around mentions Entities Dice score, entities in document Dice score, entities around mentions Metadata features Speech/text News/non-news Same document Social context features Heuristic Probabilistic COE Features introduction  wikitology applications discussion  conclusion

  45. KB features - instances Known alias Also derived aliases from test collection BBN name match Famous singleton KB features - semantic match Entity type match Sex match Number match Occupation match Fuzzy occupation match Nationality match Spouse match Parent match Sibling match KB features - ontology Wikitology Top Wikitology category matches Top Wikitology article matches Different top Wikitology person Different top Wikitology organization Top Wikitology categories in disambiguation set Reuters topics Cosine score, words in document Cosine score, words around mentions Thesaurus concepts Cosine score, words in document Cosine score, words around mentions More COE Features introduction  wikitology applications discussion  conclusion

  46. Clustering • Approach • Assign score to each entity pair (SVM or heuristic) • Eliminate pairs whose score does not exceed threshold (0.95 for SVM runs) • Identify connected components in resulting graph • Large clusters • AP (good) • Clinton (bad; conflates William and Hillary) • Sources of large clusters varied • Connected components clustering • SERIF errors • Insufficient features to distinguish separate entities introduction  wikitology applications discussion  conclusion

  47. Results on ACE08 Pilot Data • A few caveats • Small data set (49 files, 5 interesting entities) • Trained off “ambiguated” ACE 2005 data; probably not our best configuration (given connected components clustering) introduction  wikitology applications discussion  conclusion

  48. Feature Ablation introduction  wikitology applications discussion  conclusion A post hoc feature ablation evaluationshowed contribution of KB features

  49. Features with High F1 scores introduction  wikitology applications discussion  conclusion Recall that F1 = 2*P*R/(P+R) Variants of exact name match, in general, especially: a name mention in one entity exactly matches one in the other (83.1%) Cosine similarity of the vectors of top Wikitology article matches (75.1%) Top Wikitology article for the two entities matched (38.1%) An entity contained a mention that was a known alias of a mention found in the other (47.5%)

  50. High Precision Features introduction  wikitology applications discussion  conclusion • High precision/low recall features are useful when applicable • Features with precision > 95% include: • A name mentioned by each entity matches exactly one person in Wikipedia • The entities have the same parent • The entities have the same spouse • All name mentions have an exact match across the two entities • Longest named mention has exact match

More Related