350 likes | 369 Views
Learning to Link with Wikipedia. David Milne, Ian H. Witten, CIKM’08 2009/12/18 Henrik Schmitz. Introduction. serendipity. Wikipedia Largest, most visited encyclopedia Densely structured Millions of links Guides to unintended information Approach
E N D
Learning to Link withWikipedia David Milne, Ian H. Witten, CIKM’08 2009/12/18 Henrik Schmitz
Introduction serendipity • Wikipedia • Largest, most visited encyclopedia • Densely structured • Millions of links • Guides to unintended information • Approach • Wikipedia’s accessibility and serendipity for all documents • Automatically find topics in unstructured text and link them to Wikipedia articles CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Introduction WIKIFICATION! CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Introduction • New • Wikipedia not only source of information • Wikipedia used as training data to create links • Improvements in recall and precision • In this paper • Machine-learning approach to wikification • In two stages • Link disambiguation • Link detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Related Work: Wikify Onekeydifference! • Wikify system by Mihalcea and Csomai (2007) • Paper’s basis • Wikifiy has also two steps, but with swapped order • Detection • Disambiguation • Paper seems weird • But uses disambiguation to inform detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Related Work: Wikify Detection Disambiguation • Identify valuable phrases by link probability: • Thus: finding all n-gramsexceeding this threshold • Link detected phrases to reasonable Wikipedia articles concerning ambiguity • Here: enormous preprocessing, entire Wikipedia must be parsed # articles using term as anchor # articles mentioning this term • Precision: 53% Recall: 56% • Precision: 93% • Recall: 86% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Related Work: topic indexing • Medelyan et al. (2008) • Similar approach to wikification • Additionally most important topics are identified • Paper improves this approach through weighting and machine-learning CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Uses links found in Wikipedia articles for training • Wikipedian make links with effort • Millions of ground truth examples to learn from • Preparation • Wikipedia version with around 2 million links • Articles with >50 links, no lists or disambiguation pages • 700 articles • 500 for training • 100 for configuration • 100 evaluation CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Each article’s link represents several training instances • Connection from anchor to destination is positive example • Remaining possible destinations are negative examples CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm # times used as destination in Wikipedia Balance commonness (prior probability) and relatedness CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Relatedness: compare senses with surrounding context • Cyclic more ambiguous terms • But generally unambiguous terms exists CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm log(max(|A|,|B|))-log(|A∩ B|) log(|W|)-log(min(|A|,|B|) * Relatedness of candidate sense is weightedaverage of relatedness to each context article • Relatedness • Select sense article which is most in common with the context articles • Relatedness between article a and b • where A, B are sets of articles linking to a and band W is set of all articles in Wikipedia CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Weight of comparisons • Do not consider all context terms equally • E.g. “the” has zero value • Find with help of Wikify’s link probability • Check relatednessof context term to central topic • Calculate average semantic relatedness using measure * • Average 1. and 2. CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Combining commonness and relatedness • Use machine-learning to adjust balancefor each document each time • Homogenous, plentiful context • Relatedness prioritized • Ambiguous, little context • Commonness prioritized • Context quality • Sum of weights of each context term • already calculated CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm Classifier • Resulting features • Number of involved terms • Extent of their relations to each other • How frequently used as Wikipedia links • Produces probability of validity of a sense CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Configuration Precision Recall • One parameter • Minimum probability of senses, which should be considered • Gain speed by higher threshold • More precision • But Less recall • Threshold around 2% • Classification algorithm • C4.5 (generates decision tree) CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Evaluation Heuristic approach by Medelyan et al. Difference: no machine-learning and weighting of context This paper‘s approach This paper‘s approach • The 100 randomly chosen articles include 11,000 anchors, which were automatically disambiguated • Always ≥88% precision; 45% perfect • Always ≥75% recall; 14% perfect • Increases by selecting all valid senses • Precision gets worse CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Evaluation • Advantages of paper’s approach • No parsing of text required • Less resources required • Less training: 500 articles against whole Wikipedia • Facts • PC: 3GHz Dual Core, 4GB RAM • Training disambiguator in 13 minutes • Tested in four minutes • Three minutes for loading data in memory CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Algorithm bases on Wikify • Key difference: Wikipedia articles are used to learn which terms should be linked and which not and context is taken into account • Wikify’s approach relies exclusively on link probability • Always mistakes: discarding relevant links and retaining irrelevant ones sometimes • Better: use link probability feature among many CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Gather all n-grams, retain those exceeding a threshold (later) • Discard nonsense phrases and stop words • Remaining phrases are disambiguated using classifier from before • Set of associations between terms and Wikipedia articles, without any part-of-speech analysis CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Used features: • Link probability • Relatedness • Disambiguation confidence • Generality • Location and Spread CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Features: Link probability • Involving several candidate link locations (e.g. “Hillary Clinton”, “Clinton”) there are multiple link probabilities • Combined into average and maximum • Average more consistent, maximum more indicative (e.g. “Democratic Party”, “Party”) • Information lost, when probabilities are averaged • Features: Relatedness • Average relatedness between each topic and all of the other candidates CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Features: Disambiguation confidence • Not just a yes/no judgment, but also a confidence to this answer • Greater chance for more sure topics • Also: combined as average and maximum value • Features: Generality • Links for specific topics are more useful, than general ones • Defined as minimum depth at which article is located in Wikipedia’s category tree CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Features: Location and spread • I.e. n-grams from which terms are mined • Frequency • First occurrence, mentioned in the introduction • Last occurrence, mentioned in the conclusion • Spread: distance between first and last occurrence • How consistently used • Must be normalized by length of document CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Configuration Recall Precision • Articles • Same 500 articles as for training disambiguation classifier • Less disambiguation errors • Terms must be disambiguated into appropriate articles before using as training instances • Same 100 articles for configuration the disambiguation classifier • One parameter: initial link probability threshold • Discard nonsense phrases and stop words • Trade-off between speed & precision and recall • 6.5% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Evaluation • 100 new randomly selected articles for evaluating • Ground truth from 9,300 manually linked topics • Stripping all markup and run link detector • Recall, precision and f-measure around 74% • Improvement against Wikify: CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Evaluation • Facts • Training link detector in 37 minutes • Tested in eight minutes CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • What about documents not obtained from Wikipedia? • Verify with new documents and human evaluator • Experimental data • 50 documents from AQUAINT text corpus (news) • Random stories with length of 300 words (attention span) • 500 new training articles • Length also 300 words, selected 50 with highest link proportion • Classifier identified 449 link-worthy topics, average 9 per article CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Participants • Amazon’s crowdsourcing service Mechanical Turk • Labor-intensive experiment without gathering of people • Concern about anonymous workers • Identify and reject low-quality responses and undesirable participants CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Evaluating detected links • 449 tasks– one for each link • Original text with one link • Participant specifies, whether link is valid or not • Three participants per link CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Identifying missing links • 50 tasks– one for each article • Article contains all links • Participant reads article and can list additional Wikipedia topics • Five participantsper article CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Results • 76% were correct • 34% were not • Mostly due to incorrect candidate identification • Similar resultsas before • Algorithm works same “in the wild” and on Wikipedia CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Wikifikation online • Results used to correct automatically-tagged articles and generated ground truth • Corpus with only manually-verified links • www.nzdl.org/wikification CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Example of an application ontology • Tool for building cross-reference documents • Structured knowledge about any unstructured document • graph representation of discussed concepts • Links between topics mean significant relation • No ambiguity • Example with content of paper (just few relations) CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Thank you for your attention! Fuhaha! Mechanical Turk or Automaton Chess Player was a fake chess-playing machine constructed in the late 18th century…that allowed a human chess master hiding inside to operate the machine. CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz