Author Name Disambiguation for Citations Using Topic and Web Correlation

Author Name Disambiguation for Citations Using Topic and Web Correlation

Prior work • Supervised classification approaches: Model all authors’ patterns from a set of training data. • Unsupervised Classification approaches: Ambiguous citations are clustered into groups of distinct authors by measuring the similarities between the attributes in the citations.

Proposed Approach • Topic Correlation • Web Correlation • Pair-Wise Grouping Algorithm

Topic Correlation • Build a topic association network 1.利用Apriori算法构造有向图，权值为置信度（结果为一个超图）。 2.利用k-way hypergraph partition算法，将超图分解为一些簇。 3.这些簇叫做topic association network，研究课题的相关强度是citations在这个网络中的距离。

Web Correlation • Use each title to query a search engine. • Filter the URLs of several digital libraries. • If two citations appear in the same URL, we use them as an instance of Web correlation.

Pair-Wise Grouping Algorithm • Generate pairs of citations by using similarity metrics • Use the training data to train a binary classifier • Apply the classifier to determine whether the pairs are matched • Combine the predicted results to group the citations into appropriate clusters. • Filter out the pairs that would cause the clusters sparse.

Pair-Wise Similarity Metrics • similarity metrics for Coauthor, Title, and Venue: 1.CSM 2.MSF • Similarity metrics for topic correlation: TSM • Similarity metrics for web correlation: MNDF

Binary Classifier • A binary classifier is used to learn the distribution of pair-wise vectors. • The pairs predicted as matched are used to build citation clusters ( constructing an undirected graph).

Cluster Filter • A threshold is set for choosing which bridges should be removed. • A bridge is removed if the numbers of vertices in two separate, but connected, components are above the given threshold.

Detecting Ambiguous Author Names in Crowdsourced Scholarly Data

Prior Work • Name disambiguation has been cast into the problem of clustering a set of publications into profiles such that each profile corresponds to a single author.

Name Variations and Citations • Extract the name variations from a collection of publications • Sort them by number of citations • Look at the percentage of the total citations that are attributed to the top name variations.( A high percentage suggests that the name is not ambiguous.)

Topic Consistency • Leverage the discipline tags crowdsourced from the users of the Scholarometer system • Detect different but related disciplines associated with an author name: • Map an author’s publications to topics, and measure the similarity between these topics. • Derive an author’s topic profile

A brief survey of automatic methods for author name disambiguation

Two problems • Synonyms: the same author may appear under distinct names • Polysems: distinct authors may have similar names.

Proposed taxonomy

Author Grouping Methods • Defining a similarity function: 1.Using predefined functions: the Levenshtein distance, Jaccard coefficient, cosine similarity, soft-TFIDF and others. 2.Learning a similarity function: Use the training data to produce a similarity function S from R*R(R: the set of references) to {0, 1}, where 1 means that the two references do refer to the same author and 0 means that they do not. 3.Exploiting graph-based similarity functions: Create a coauthorship graph G=(V, E) for each ambiguous group. The same coauthor names are represented by a vertex, and the weight is related to the amount of articles coauthored by the corresponding author names represented by the two vertices.

Author Grouping Methods • Clustering Techniques: 1.Partitioning 2.Hierarchical agglomerative clustering 3.density-based clustering 4.Spectral clustering

Author assignment methods • Classification: Assign the references to their authors using a supervised machine learning technique. • Clustering: Use probabilistic techniques to determine the author in a iterative way to fit the model.

Explored evidence • Citation information: the attributes directly extracted from the citations, such as author/coauthor names, work title, publication venue title, year, and so on. • Web information: Data retrieved from the web that is used as additional information about an author publication profile. • Implicit evidence: Evidence inferred from visible elements of attributes, such as the latent topics of a citation.

Summary of characteristics-Author grouping methods

Summary of characteristics-Author assignment methods

Open challenges • Very little data in the citations • Very ambiguous cases -- ambiguous references will have coauthors who have also ambiguous names (especially Asian names) • Citations with errors • Efficiency • Different knowledge areas -- our focus is only about computer science • Incremental disambiguation • Author profile changes • New authors

pandasearch 重名问题研究计划 • 相关论文的阅读，找出最适合当前问题的解决措施。 • 着重从implicit evidence和web information（特别是学者个人主页和cv）入手。 • 从效率和准确度两个方向着手，着重准确度。 • 数据挖掘和机器学习基础知识的学习。

pandasearch 重名问题实现计划 • Type of approach: author grouping methods– learning a similarity function. • Explored evidence: citation information, webinformation, implicit evidence.

Author Name Disambiguation for Citations Using Topic and Web Correlation

Author Name Disambiguation for Citations Using Topic and Web Correlation

Presentation Transcript

Improving the performance of personal name disambiguation using web directories

Topic Name :

Using Encyclopedic Knowledge for Named Entity Disambiguation

Using discontinuities for stratigraphic correlation

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Person Name Disambiguation by Bootstrapping

Topic Name

Paper Title Author Name 1 , Author Name 2 , Author Name 3

Topic Name

Name Disambiguation in Digital Libraries

Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation

Using Semantic Relatedness for Word Sense Disambiguation

Co-occurrence and place name disambiguation.

Author Name 2 , Author Name 5

Name of author

Topic 13 - Correlation

Improving the performance of personal name disambiguation using web directories

Contextual Search and Name Disambiguation in Email using Graphs

Contextual Search and Name Disambiguation in Email Using Graphs

Name the Author

Using Encyclopedic Knowledge for Named Entity Disambiguation

Contextual Search and Name Disambiguation in Email using Graphs