1 / 35

Learning to Link with Wikipedia

Learning to Link with Wikipedia. David Milne, Ian H. Witten, CIKM’08 2009/12/18 Henrik Schmitz. Introduction. serendipity. Wikipedia Largest, most visited encyclopedia Densely structured Millions of links Guides to unintended information Approach

rdelaney
Download Presentation

Learning to Link with Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Link withWikipedia David Milne, Ian H. Witten, CIKM’08 2009/12/18 Henrik Schmitz

  2. Introduction serendipity • Wikipedia • Largest, most visited encyclopedia • Densely structured • Millions of links • Guides to unintended information • Approach • Wikipedia’s accessibility and serendipity for all documents • Automatically find topics in unstructured text and link them to Wikipedia articles CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  3. Introduction WIKIFICATION! CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  4. Introduction • New • Wikipedia not only source of information • Wikipedia used as training data to create links • Improvements in recall and precision • In this paper • Machine-learning approach to wikification • In two stages • Link disambiguation • Link detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  5. Related Work: Wikify Onekeydifference! • Wikify system by Mihalcea and Csomai (2007) • Paper’s basis • Wikifiy has also two steps, but with swapped order • Detection • Disambiguation • Paper seems weird • But uses disambiguation to inform detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  6. Related Work: Wikify Detection Disambiguation • Identify valuable phrases by link probability: • Thus: finding all n-gramsexceeding this threshold • Link detected phrases to reasonable Wikipedia articles concerning ambiguity • Here: enormous preprocessing, entire Wikipedia must be parsed # articles using term as anchor # articles mentioning this term • Precision: 53% Recall: 56% • Precision: 93% • Recall: 86% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  7. Related Work: topic indexing • Medelyan et al. (2008) • Similar approach to wikification • Additionally most important topics are identified • Paper improves this approach through weighting and machine-learning CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  8. Disambiguation: Algorithm • Uses links found in Wikipedia articles for training • Wikipedian make links with effort • Millions of ground truth examples to learn from • Preparation • Wikipedia version with around 2 million links • Articles with >50 links, no lists or disambiguation pages • 700 articles • 500 for training • 100 for configuration • 100 evaluation CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  9. Disambiguation: Algorithm • Each article’s link represents several training instances • Connection from anchor to destination is positive example • Remaining possible destinations are negative examples CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  10. Disambiguation: Algorithm # times used as destination in Wikipedia Balance commonness (prior probability) and relatedness CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  11. Disambiguation: Algorithm • Relatedness: compare senses with surrounding context • Cyclic  more ambiguous terms • But generally unambiguous terms exists CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  12. Disambiguation: Algorithm log(max(|A|,|B|))-log(|A∩ B|) log(|W|)-log(min(|A|,|B|) * Relatedness of candidate sense is weightedaverage of relatedness to each context article • Relatedness • Select sense article which is most in common with the context articles • Relatedness between article a and b • where A, B are sets of articles linking to a and band W is set of all articles in Wikipedia CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  13. Disambiguation: Algorithm • Weight of comparisons • Do not consider all context terms equally • E.g. “the” has zero value • Find with help of Wikify’s link probability • Check relatednessof context term to central topic • Calculate average semantic relatedness using measure * • Average 1. and 2. CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  14. Disambiguation: Algorithm • Combining commonness and relatedness • Use machine-learning to adjust balancefor each document each time • Homogenous, plentiful context • Relatedness prioritized • Ambiguous, little context • Commonness prioritized • Context quality • Sum of weights of each context term • already calculated CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  15. Disambiguation: Algorithm Classifier • Resulting features • Number of involved terms • Extent of their relations to each other • How frequently used as Wikipedia links • Produces probability of validity of a sense CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  16. Disambiguation: Configuration Precision Recall • One parameter • Minimum probability of senses, which should be considered • Gain speed by higher threshold • More precision • But Less recall • Threshold around 2% • Classification algorithm • C4.5 (generates decision tree) CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  17. Disambiguation: Evaluation Heuristic approach by Medelyan et al. Difference: no machine-learning and weighting of context This paper‘s approach This paper‘s approach • The 100 randomly chosen articles include 11,000 anchors, which were automatically disambiguated • Always ≥88% precision; 45% perfect • Always ≥75% recall; 14% perfect • Increases by selecting all valid senses • Precision gets worse CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  18. Disambiguation: Evaluation • Advantages of paper’s approach • No parsing of text required • Less resources required • Less training: 500 articles against whole Wikipedia • Facts • PC: 3GHz Dual Core, 4GB RAM • Training disambiguator in 13 minutes • Tested in four minutes • Three minutes for loading data in memory CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  19. Detection: Algorithm • Algorithm bases on Wikify • Key difference: Wikipedia articles are used to learn which terms should be linked and which not and context is taken into account • Wikify’s approach relies exclusively on link probability • Always mistakes: discarding relevant links and retaining irrelevant ones sometimes • Better: use link probability feature among many CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  20. Detection: Algorithm • Gather all n-grams, retain those exceeding a threshold (later) • Discard nonsense phrases and stop words • Remaining phrases are disambiguated using classifier from before • Set of associations between terms and Wikipedia articles, without any part-of-speech analysis CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  21. Detection: Algorithm • Used features: • Link probability • Relatedness • Disambiguation confidence • Generality • Location and Spread CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  22. Detection: Algorithm • Features: Link probability • Involving several candidate link locations (e.g. “Hillary Clinton”, “Clinton”) there are multiple link probabilities • Combined into average and maximum • Average more consistent, maximum more indicative (e.g. “Democratic Party”, “Party”) • Information lost, when probabilities are averaged • Features: Relatedness • Average relatedness between each topic and all of the other candidates CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  23. Detection: Algorithm • Features: Disambiguation confidence • Not just a yes/no judgment, but also a confidence to this answer • Greater chance for more sure topics • Also: combined as average and maximum value • Features: Generality • Links for specific topics are more useful, than general ones • Defined as minimum depth at which article is located in Wikipedia’s category tree CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  24. Detection: Algorithm • Features: Location and spread • I.e. n-grams from which terms are mined • Frequency • First occurrence, mentioned in the introduction • Last occurrence, mentioned in the conclusion • Spread: distance between first and last occurrence • How consistently used • Must be normalized by length of document CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  25. Detection: Configuration Recall Precision • Articles • Same 500 articles as for training disambiguation classifier • Less disambiguation errors • Terms must be disambiguated into appropriate articles before using as training instances • Same 100 articles for configuration the disambiguation classifier • One parameter: initial link probability threshold • Discard nonsense phrases and stop words • Trade-off between speed & precision and recall • 6.5% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  26. Detection: Evaluation • 100 new randomly selected articles for evaluating • Ground truth from 9,300 manually linked topics • Stripping all markup and run link detector • Recall, precision and f-measure around 74% • Improvement against Wikify: CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  27. Detection: Evaluation • Facts • Training link detector in 37 minutes • Tested in eight minutes CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  28. Wikification in the Wild • What about documents not obtained from Wikipedia? • Verify with new documents and human evaluator • Experimental data • 50 documents from AQUAINT text corpus (news) • Random stories with length of 300 words (attention span) • 500 new training articles • Length also 300 words, selected 50 with highest link proportion • Classifier identified 449 link-worthy topics, average 9 per article CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  29. Wikification in the Wild • Participants • Amazon’s crowdsourcing service Mechanical Turk • Labor-intensive experiment without gathering of people • Concern about anonymous workers • Identify and reject low-quality responses and undesirable participants CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  30. Wikification in the Wild • Evaluating detected links • 449 tasks– one for each link • Original text with one link • Participant specifies, whether link is valid or not • Three participants per link CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  31. Wikification in the Wild • Identifying missing links • 50 tasks– one for each article • Article contains all links • Participant reads article and can list additional Wikipedia topics • Five participantsper article CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  32. Wikification in the Wild • Results • 76% were correct • 34% were not • Mostly due to incorrect candidate identification • Similar resultsas before • Algorithm works same “in the wild” and on Wikipedia CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  33. Wikification in the Wild • Wikifikation online • Results used to correct automatically-tagged articles and generated ground truth • Corpus with only manually-verified links • www.nzdl.org/wikification CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  34. Example of an application ontology • Tool for building cross-reference documents • Structured knowledge about any unstructured document • graph representation of discussed concepts • Links between topics mean significant relation • No ambiguity • Example with content of paper (just few relations) CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

  35. Thank you for your attention! Fuhaha! Mechanical Turk or Automaton Chess Player was a fake chess-playing machine constructed in the late 18th century…that allowed a human chess master hiding inside to operate the machine. CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz

More Related