240 likes | 390 Views
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework. Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL ’ 05. Abstract. They consider the problem of ambiguous author names in bibliographic citations . Scalable two-step framework
E N D
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL’05
Abstract • They consider the problem of ambiguous author names in bibliographic citations. • Scalable two-step framework • Reduce the number of candidates via blocking (four methods) • Measure the distance of two names via coauthor information (seven measures)
Introduction • Citation records are important resources for academic communities. • Keeping citations correct and up-to-date proved to be a challenging task in a large-scale. • We focus on the problem of ambiguous author names. • It is difficult to get the complete list of the publications of some authors. • “John Doe” published 100 articles, but DL keeps two separate purported author names, “John Doe” and “J. D. Doe”, each contains 50 citations.
Problem • Problem definition: • The baseline approach:
Solution • Rather than comparing each pair of author names to find similar names, they advocate a scalable two-step name disambiguation framework. • Partition all author-name strings into blocks • Visit each block and compare all possible pairs of names within the block
Blocking (1/3) • The goal of step 1 is to put similar records into the same group by some criteria. • They examine four representative blocking methods • heuristics, token-based, n-gram, sampling
Blocking (2/3) • Spelling-based heuristics • Group author names based on name spellings • Heuristics: iFfL, iFiL, fL, combination • iFfL: e.g. “Jeffrey Ullman”, “J. Ullman” • Token-based • Author names sharing at least one common token are grouped into the same block • e.g., “Jeffrey D. Ullman” and “Ullman, Jason”
Blocking (3/3) • N-gram • N=4 • The number of author names put into the same block is the largest one. • e.g. “David R. Johnson”, “F. Barr-David” • Sampling • Sampling-based join approximation • Each token from all author names has an TFIDF weight. • Each author name has its token weight vector. • All pairs of names with similarity of at least θ can be put into the same block.
Measuring Distances • The goal of step 2 is, for each block, to identify top-k author names that are the closest. • Supervised method • Naïve Bayes Model, Support Vector Machine • Unsupervised method • String-based Distance, Vector-based Cosine Distance
Supervised Methods (1) • Naïve Bayes Model Training: • A collection of coauthors of x are randomly split, and only the half is used for training. • They estimate each coauthor’s conditional probability P(Aj|x) Testing:
Supervised Methods (2) • Support Vector Machine • All coauthor information of an author in a block is transformed into vector-space representation. • Author names in a block are randomly split, 50% is used for training, and the other 50% is used for testing. • SVM creates a maximum-margin hyperplane that splits the YES and NO training examples. • In testing, the SVM classifies vectors by mapping them via kernel trick to a high dimensional. • Radial Basis Function kernel
Unsupervised Methods(1) • String-based Distance • The distance between two author names are measured by the “distance” between their coauthor lists. • Two token-based string distances • Two edit-distance-based string distances
Unsupervised Methods(2) • Vector-based Cosine Distance • They model the coauthor lists as vectors in the vector space and compute the distances between the vectors. • They use the simple cosine distance.
Data Sets • They gathered real citation data from four different domains. • DBLP, e-Print, BioMed, EconPapers • Different disciplines appear to have slightly different citation policies and the conventions of citations also vary. • Number of coauthors per article • Use the initial of first name instead of full name
Artificial name variants • Given the large number of citations, it is not possible nor practical to find a “real” solution set. • They pick top-100 author names from Y according to their number of citations, and generate 100 corresponding new name variants artificially. • “Grzegorz Rozenberg” with 344 citations and 114 coauthors in DBLP, we create a new name like “G. Rozenberg” or “Grzegorz Rozenbergg”. • Splitting the original 344 citations into halves, each name carries half of citations 172 • They test if the algorithm is able to find the corresponding artificial name variant in Y
Artificial name variants • Error type: e.g. “Ji-Woo K. Li” • Abbreviation: “J. K. Ki” • Name alternation: “Li, Ji-Woo K.” • Typo: “Ji-Woo K. Lee” or “Jee-Woo K. Li” • Contraction: “Jiwoo K. Li” • Omission: “Ji-Woo Li” • Combinations • The quantify the effect of error types on the accuracy of name disambiguation is measured.
Artificial name variants • (1) mixed error types of abbreviation (30%), alternation (30%), typo (12% each in first/last name), contraction (2%), omission (4%), and combination (10%) • (2) abbreviation of the first name (85%) and typo (15%)
Evaluation metrics • Scalability • Size of blocks generated in step 1 • Time it took to process both step 1 and 2 • Accuracy • They measured the accuracy of top-k.
Scalability • The average # of authors in each block • Processing time for step 1 and 2
Accuracy • Four blocking methods combined with seven distance metrics for all four data set with k = 5. • EconPapers data set is omitted.
Conclusion • They compared various configurations (four blocking in step 1, seven distance metrics via “coauthor” information in step 2), against four data sets. • A combination of token-based or N-gram blocking (step 1) and SVM as a supervised method or cosine metric as a unsupervised method (step 2) gave the best scalability/accuracy trade-off. • The accuracy of simple name spelling based heuristics were shown to be quite sensitive to the error types. • Edit distance based distance metrics such as Jaro or Jaro-Winkler proved to be inadequate for large-scale name disambiguation problem for its slow processing time.