250 likes | 415 Views
Efficient Topic-based Unsupervised Name Disambiguation Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles JCDL2007. Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21. Outline. Introductoin Related Work Method Topic-based PLSA (Probabilistic Latent Semantic Analysis)
E N D
Efficient Topic-based Unsupervised Name DisambiguationYang Song, Jian Huang, Isaac G. Councill,Jia Li, C. Lee GilesJCDL2007 Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21
Outline • Introductoin • Related Work • Method • Topic-based PLSA (Probabilistic Latent Semantic Analysis) • Topic-based LDA (Latent Dirichlet Allocation) • Clustering • Experiment • Conclusion Y.H Chang
Introductoin • Name ambiguity • Sharing same name, misspelling, name abbreviations • Searching Google for “Yang Song”: • 1st page shows five different people’s home pages • In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. Y.H Chang
Introductoin • Method Learning topic-name matrix by PLSA and LDA (feature set) Topic disambiguate with agglomerative clustering method In similar topic: generate name-name matrix People disambiguate with another agglomerative clustering method Y.H Chang
Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion Y.H Chang
Related Work • [19]G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. 2003 (transitivity problem) • [9]H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. 2005 (complexity O(N2)) • [12]J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. 2006 • [2]I. Bhattacharya and L. Getoor. A latent dirichlet model forunsupervised entity resolution. 2006 • The aforementioned work mainly tackled the name disambiguation problem using the metadata records of the authors. This paper solves the name disambiguation problem in a novel way, by accounting for the topic distribution of the authors and adopting unsupervised methods. Y.H Chang
Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion • Method Learning topic-name matrix by PLSA and LDA (feature set) Topic disambiguate with agglomerative clustering method … …… … Y.H Chang
PLSA • From a statistical point of view, (1999) Hofmann presented an alternative to LSA, or Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI), which discovers sets of latent variables. • The model is described as an aspect model, assuming the existence of hidden factors underlying the co-occurrences among two sets of objects. Y.H Chang
People’s name document PLSA Word z: topic of document Size=K • The goal of model fitting for PLSA is to estimate the parameters P(z),P(a|z), P(z|d),P(w|z), given a set of observations (d, a,w). The standard way to estimate the probability values is the Expectation-Maximization (EM) algorithm Y.H Chang
PLSA Y.H Chang
PLSA-Predicting New Name Appearances • Additionally, there is no natural way to assign probability to new documents. • Therefore, to predict the topics of new documents (with potentially new names) after training, the estimated P(w|z) parameters are used to estimate P(a|z) for new names a in test document dnewthrough a “folding-in” process. • Specifically, the E-step is the same as equation (4); however, the M-step maintains the original P(w|z) and only updates P(a|z) as well as P(z|d). Y.H Chang
LDA • (2003) Blei et al. introduced a Bayesian hierarchical model, Latent Dirichlet Allocation (LDA), in which each document has its own topic distribution, drawn from a conjugate Dirichlet prior that remains the same for all documents in a collection. Y.H Chang
a multinomial distribution φz for each topic z LDA • In our model, names (authors) and words are not directly related, i.e., each topic can generate a set of names and a set of words simultaneously with different probabilities, allowing more freedom to the model in parameter estimation. a word wdi from the multinomial distribution φzdi a multinomial Distribution θd a topic zdi from the multinomial distribution θd a name adi from the multinomial distribution λzdi Y.H Chang
LDA • In the following section, we apply the Gibbs sampling framework to get around the intractability problem of parameter estimation. Y.H Chang
Note that in our case, we do not estimate the parameters α, β and λ. For simplicity and performance, they are fixed at 50/K, 0.01 and 0.1 respectively. Gibbs sampling for the LDA model Y.H Chang
Clustering Learning topic-name matrix by PLSA and LDA (feature set) Levenshtein distance (defined as Le(x, y)) is used as the measurement and as a result the similarity between two names x and y Topic disambiguate with agglomerative clustering method In similar topic: generate name-name matrix People disambiguate with another agglomerative clustering method Y.H Chang
Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion Y.H Chang
Experiment • Web Appearances of Person Names • 12 person names => 187 different people • including SRI employees andprofessors are submittedas queries to the Google search engine, the first 100 pages are then retrieved for each query. Furthermore, to eliminate the bias towards longer documents, only the first 200 words are used in each example. • Author Appearances in Scientific Docs • We obtained the 9 most ambiguous author names from the entire data set , each of which has at least 20 name variations. In the worst case (C. Chen), 103 authors share the same name. Y.H Chang
Experiment • Evaluation : • pair-level pairwise F1 score F1P and clusterlevel pairwise F1 score F1C. • F1P is defined as the pairwise precision pp and pairwise recall pr • Likewise, F1C is the harmonic mean of cluster precision cp and cluster recall cr Y.H Chang
author-topic relationships in the CiteSeer data set extracted by the topic-based PLSA model. Y.H Chang
Experiment Y.H Chang
Experiment Y.H Chang
Experiment • As a result, we empirically tested our models for the entire CiteSeer data set with more than 750,000 documents. • PLSA yields 418,500 unique authors in 2,570 minutes, while LDA finishes in 4,390 minutes with 418,775 authors.(1~3 days) Y.H Chang
Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion Y.H Chang
Conclusion • We have proposed a novel framework for unsupervised name disambiguation by leveraging graphical Bayesian models and a hierarchical clustering method. • Although our primary focus in this paper is on person name disambiguation, our general approach should be equally applicable to other entity disambiguation domains. • Potential applications include noun phrases disambiguation, e.g., “tiger” as an animal, “tiger” as a golf player, “tiger” the baseball team, “tiger” the operating system or “tiger” for the new Java version. Y.H Chang