120 likes | 234 Views
Disambiguation Problems in Digital Libraries. Tan Yee Fan 2006 August 11 WING Group Meeting. Introduction. Bibliographic digital libraries DBLP, Citeseer, ACM Portal, … Metadata records Authors, title, venue, year, … Inconsistencies and errors Typographical errors Abbreviation
E N D
Disambiguation Problemsin Digital Libraries Tan Yee Fan 2006 August 11 WING Group Meeting
Introduction • Bibliographic digital libraries • DBLP, Citeseer, ACM Portal, … • Metadata records • Authors, title, venue, year, … • Inconsistencies and errors • Typographical errors • Abbreviation • Different entities sharing same name • …
Problem formulation • General disambiguation problem • Given a list of data items X • Find a function δ : X × X → {0, 1} such that • δ(x1, x2) = 1 if x1 and x2 matches • δ(x1, x2) = 0 otherwise • Matching relation is not necessarily transitive • δ(“ab”, “bc”) = 1 and δ(“bc”, “cd”) = 1,but δ(“ab”, “cd”) = 0 • If transitive, it is clustering/classification
Related fields • String similarity • Edit distance, Jaro-Winkler, … • Abbreviation matching • Mostly deals with biomedical texts and in predefined formats • Data cleaning • High level architectures by database people • Social network analysis • Collaboration graphs of authors
Citation matching, author name disambiguation • Can be cast as classification/clustering • Usual information source • Coauthor information, titles and venues • i.e. within the records themselves (internal) • Models • Naïve Bayes, K-means, SVM, vector space model, graphical models, … • Some apply methods to reduce number of comparisons required
Resources • Internal resources • May contain insufficient information • Information may be difficult to extract • External resources • Web resources, ontologies • Contains additional freely available information • Objective • Combine internal and external resources
Mixed citation problem • Given an ambiguous name X (belonging to k different authors) • Given a list of citations C containing X • Which citations in C belong to which author? Yoojin Hong, Byung-Won On and Dongwon Lee. SystemSupport for Name Authority Control Problem inDigital Libraries: OpenDBLP Approach. ECDL 2004. Sudha Ram, Jinsoo Park and Dongwon Lee. DigitalLibraries for the Next Millennium: Challenges andResearch Directions. Information Systems Frontiers 1999.
Search engine results • For each citation c in C • Query search engine with title of c to obtain relevant URLs • Represent c by a feature vector of relevant URLs • Each URL weighted by its inverse host frequency • Cosine similarity between feature vectors • Perform clustering on C to derive k clusters
External coauthor network • Coauthor network from DBLP metadata • Delete the node representing X and its edges • Similarity between two author names computed as an inverse of their distance • Similarity between two citations is pairwise sum of their author similarities Connected if they arecoauthors in someDBLP citation Each noderepresents a name
Venue name disambiguation • To determine e.g. “TREC” = “Text Retrieval Conference” • Not using other parts of the citation records • Problems • Abbreviations are extremely common • Venues change name over time • Experiments using Google in progress • Using URL features • Using Google snippets