1 / 19

Learning to Link

David Milne | Ian H. Witten. Learning to Link. with. Wikipedia. The University of Waikato | New Zealand. Motivation. Links between Wikipedia articles provide Explanation Investigation Serendipity Can we add the same links to all documents?. Learning to Link. Learning to Link.

amalia
Download Presentation

Learning to Link

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. David Milne | Ian H. Witten Learning to Link with Wikipedia The University of Waikato | New Zealand

  2. Motivation • Links between Wikipedia articles provide • Explanation • Investigation • Serendipity • Can we add the same links to all documents?

  3. Learning to Link LearningtoLink with with Wikipedia Wikipedia The University of Waikato | New Zealand The University of Waikato | New Zealand David Milne | Ian H. Witten

  4. Related Work Mihalcea, R. and Csomai, A. Wikify! linking documents to encyclopedic knowledge. In Proceedings of CIKM’07, Lisbon, Portugal INEX Link to the Wiki Track

  5. Algorithm A two step process • Link Disambiguation • Link Selection Learning to Link with Wikipedia Learning to Link with Wikipedia Learning to Link with Wikipedia

  6. Algorithm | Disambiguation For every link in Wikipedia, a human author has manually chosen the correct destination [[ Napa, California | napa ]] Napa, California [[ Napa River | napa ]] Napa River napa Napa County, California [[ Napa County, California | napa ]] National Automotive Parts Association [[ NAPA | napa ]]

  7. Algorithm | Disambiguation For every link in Wikipedia, a human author has manually chosen the correct destination A machine-learned approach with two main features • Commonness (or prior probability) • Relatedness to context

  8. Algorithm | Disambiguation Commonness “Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.”

  9. Algorithm | Disambiguation Relatedness “Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.” “The story begins on the banks of the Rio Negro in the Central Amazon. A party of scientists is embarking on a voyage which they hope will provide answers to a five hundred year old mystery.”

  10. Algorithm | Disambiguation Relatedness Dependency theory Illegal immigration Capitalism Trade Overnight rate Division of labour MasterCard Imperialism Colonization Accenture Globalization Bank Debit card Corporation Financial market European Union Automated teller machine Human migration Mixed economy World Bank Mergers & Aquisitions Assets inflation

  11. Algorithm | Disambiguation Balancing commonness and relatedness • Homogenous, plentiful context ▲relatedness ▼commonness • Ambiguous, sparse context ▼relatedness ▲commonness • Third feature: quality of context

  12. Evaluation | Disambiguation Wikipedia provides ground truth as well as training data • trained on 500 articles • developed and tweaked on 100 articles • tested on 100 articles recall 96% precision 98%

  13. Algorithm | Link Selection Every Wikipedia article is an example of how to cross-reference a document with Wikipedia. A machine-learned approach • Detect and disambiguate every term or phrase that might be linked. • Use features of concepts and where they are found to learn what to link.

  14. 15% Algorithm | Link Selection Wikipedia’s links provide a huge vocabulary of which terms correspond to concepts Six (number) Article (grammar) 0.002% “Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.” Property One half

  15. Algorithm | Link Selection Wikipedia’s links provide a huge vocabulary of which terms correspond to concepts Bank of England England Central Bank Bank Percentage point “Six central banks, including the Bank of England, have cut interest rates by half a percentage point in an effort to steady the faltering global economy.” Percentage Energy Interest Interest Rate Global Economy Economy

  16. Algorithm | Link Selection Features • Link Probability • Relatedness • Disambiguation Confidence • Generality • Location and Spread

  17. Evaluation | Link Selection On 100 randomly selected Wikipedia articles recall 74% precision 74% On 50 news documents, with human judgments recall 73% precision 76% 50% improvement on previous work

  18. Plain Text Information Retrieval Parsing Natural language Computer Science Knowledge Base Wikipedia Algorithm Semantics Data Mining Ontology (computer science) Document Classification Encyclopedia New Zealand Machine Learning Hamilton, NZ Clustering University of Waikato Support Vector Machine Implications | and applications We can… …add explanatory links to any document • Augment news stories, blogs, educational materials • Assist creation of new Wikipedia articles …improve how documents are represented • Information Retrieval • Topic Indexing (Olena Medelyan) • Document Clustering (Anna Huang) • Multi-document Summarization (Vivi Nastase)

  19. Thanks! | Any Questions? www.nzdl.org/wikification

More Related