Challenge Problem: Link Mining

Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park

Link Mining • Data • Structured Input: Mining graphs and networks • Structured Output: Extracting entity and relationships from unstructured data • Making use of Links • For ranking nodes • For collective classification of nodes • Discovering Links • Predicting missing links • Discovering new kinds of links and relationships

Link Mining Tasks • Node Centric • Labeling/ranking nodes (aka Collective Classification/PageRank) • Consolidating nodes (aka Entity Resolution) • Discovering hidden nodes (aka Group Discovery) • Edge Centric • Labeling/ranking edges • Predicting the existence of edges • Predicting the number of edges • Discovering new relations/paths • Graph/Subgraph Centric • Discovering frequent subpatterns • Generative models • Metadata discovery, extraction, and reformulation Reference: SigKDD Explorations Special Issue on Link Mining, December 2005.

The Link Mining Challenge • Current research mostly focus on a single task, e.g., node ranking or link prediction • In real data analysis scenarios, we need a mix of all of these capabilities • Many potential domains: • Bioinformatics • Social network analysis • Citation Analysis • Fraud detection • ….

Challenge Problem Requirements • Relevant to data mining and based on analysis of large volumes of data (including web, text, images, links, etc), preferably publicly available data. • Important and difficult so that its solution will advance the field and benefit the society • Interesting and exciting to attract researchers, public and press attention, and funding. This requires a simple and concise problem statement • The required domain knowledge should be relatively accessible. • Other groups are not actively working on this problem already

Domain Evangelists: “Goal to distribute free encyclopedia to every single person on the planet in their own language” Jimmy Wales Wikipedia founder “Disaster is not too strong a word for wikipedia… the site is infested with moonbats” Eric Raymond, Open-source movement figure Collaboratively edited user contributed encyclopedia Largest example of participatory journalism to date. Mantra: maintain a neutral point of view (NPOV) Detractors::”Wikipedia has gone from a nearly perfect anarchy to an anarchy with gang rule.” Larry Sanger Wikipedia co-founder Know It All: Can Wikipedia Conquer Expertise? Stacy Schiff, New Yorker, July 31, 2006

Task #1: Descriptive Modeling Modeling Growth of Wikipedia

Task #2: User Classification • Wiki Gnome: user that keeps a low profile, fixing typos, poor grammar and broken links • Wiki Troll: disruptive user who persistently violates the site’s guidelines vs. Gnome Troll

Task #3: Text Classification • Three Wikipedia Content Guidelines: • NPOV: represent views fairly and without bias • Verifiability • No original research

#4: Link Prediction/Completion • Identify where links should exist • As Wikipedia grows, it becomes harder for any given author to know about other relevant stuff they can/should link to from some article. • Some method that could help with this (link suggestion, auto linking, etc.) would potentially be very useful. • Evaluation: Generate a dataset by taking a given set of wikipedia pages, removing some of the existing links, and then see if a system could identify those places and suggest appropriate links.

Other Link Mining Tasks • Trust/Reputation analysis • “Gives no privilege to those who know what they are talking about”, William Connolley, climate modeler and Wikipedia admin • Social network analysis • Identification of communities • Accuracy • Nature comparison with Britannica (4-3 error ratio) • Misuse • Vandalism and self-promotion • Coverage • Which areas aren’t covered, or are poorly covered/linked?

But none of these are grand challenges… • According to wikipedia

The Wikipedia Grand Challenge • The Wikipedia Test: • Given a collection of entries constructed via participatory journalism (PJ) vs. link mining (LM), • Can you distinguish between PJ and LM? Which is better? • Evaluation: • Via a panel of human experts • Via page rank Solution will require a variety of integrated link mining capabilities

$$ Already Available… • Hutter prizehttp://prize.hutter1.net/ • 50,000 € ≈ $64,000http://en.wikipedia.org/wiki/The_64%2C000_Dollar_Question

Challenge Problem: Link Mining