250 likes | 472 Views
Author Name Disambiguation for Citations Using Topic and Web Correlation. Prior work. Supervised classification approaches: Model all authors’ patterns from a set of training data. Unsupervised Classification approaches:
E N D
Author Name Disambiguation for Citations Using Topic and Web Correlation
Prior work • Supervised classification approaches: Model all authors’ patterns from a set of training data. • Unsupervised Classification approaches: Ambiguous citations are clustered into groups of distinct authors by measuring the similarities between the attributes in the citations.
Proposed Approach • Topic Correlation • Web Correlation • Pair-Wise Grouping Algorithm
Topic Correlation • Build a topic association network 1.利用Apriori算法构造有向图,权值为置信度(结果为一个超图)。 2.利用k-way hypergraph partition算法,将超图分解为一些簇。 3.这些簇叫做topic association network,研究课题的相关强度是citations在这个网络中的距离。
Web Correlation • Use each title to query a search engine. • Filter the URLs of several digital libraries. • If two citations appear in the same URL, we use them as an instance of Web correlation.
Pair-Wise Grouping Algorithm • Generate pairs of citations by using similarity metrics • Use the training data to train a binary classifier • Apply the classifier to determine whether the pairs are matched • Combine the predicted results to group the citations into appropriate clusters. • Filter out the pairs that would cause the clusters sparse.
Pair-Wise Similarity Metrics • similarity metrics for Coauthor, Title, and Venue: 1.CSM 2.MSF • Similarity metrics for topic correlation: TSM • Similarity metrics for web correlation: MNDF
Binary Classifier • A binary classifier is used to learn the distribution of pair-wise vectors. • The pairs predicted as matched are used to build citation clusters ( constructing an undirected graph).
Cluster Filter • A threshold is set for choosing which bridges should be removed. • A bridge is removed if the numbers of vertices in two separate, but connected, components are above the given threshold.
Detecting Ambiguous Author Names in Crowdsourced Scholarly Data
Prior Work • Name disambiguation has been cast into the problem of clustering a set of publications into profiles such that each profile corresponds to a single author.
Name Variations and Citations • Extract the name variations from a collection of publications • Sort them by number of citations • Look at the percentage of the total citations that are attributed to the top name variations.( A high percentage suggests that the name is not ambiguous.)
Topic Consistency • Leverage the discipline tags crowdsourced from the users of the Scholarometer system • Detect different but related disciplines associated with an author name: • Map an author’s publications to topics, and measure the similarity between these topics. • Derive an author’s topic profile
A brief survey of automatic methods for author name disambiguation
Two problems • Synonyms: the same author may appear under distinct names • Polysems: distinct authors may have similar names.
Author Grouping Methods • Defining a similarity function: 1.Using predefined functions: the Levenshtein distance, Jaccard coefficient, cosine similarity, soft-TFIDF and others. 2.Learning a similarity function: Use the training data to produce a similarity function S from R*R(R: the set of references) to {0, 1}, where 1 means that the two references do refer to the same author and 0 means that they do not. 3.Exploiting graph-based similarity functions: Create a coauthorship graph G=(V, E) for each ambiguous group. The same coauthor names are represented by a vertex, and the weight is related to the amount of articles coauthored by the corresponding author names represented by the two vertices.
Author Grouping Methods • Clustering Techniques: 1.Partitioning 2.Hierarchical agglomerative clustering 3.density-based clustering 4.Spectral clustering
Author assignment methods • Classification: Assign the references to their authors using a supervised machine learning technique. • Clustering: Use probabilistic techniques to determine the author in a iterative way to fit the model.
Explored evidence • Citation information: the attributes directly extracted from the citations, such as author/coauthor names, work title, publication venue title, year, and so on. • Web information: Data retrieved from the web that is used as additional information about an author publication profile. • Implicit evidence: Evidence inferred from visible elements of attributes, such as the latent topics of a citation.
Open challenges • Very little data in the citations • Very ambiguous cases -- ambiguous references will have coauthors who have also ambiguous names (especially Asian names) • Citations with errors • Efficiency • Different knowledge areas -- our focus is only about computer science • Incremental disambiguation • Author profile changes • New authors
pandasearch 重名问题研究计划 • 相关论文的阅读,找出最适合当前问题的解决措施。 • 着重从implicit evidence和web information(特别是学者个人主页和cv)入手。 • 从效率和准确度两个方向着手,着重准确度。 • 数据挖掘和机器学习基础知识的学习。
pandasearch 重名问题实现计划 • Type of approach: author grouping methods– learning a similarity function. • Explored evidence: citation information, webinformation, implicit evidence.