120 likes | 237 Views
Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation. AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer. Also By The Same Author. Name ambiguity a problem for automated information extraction. Two problems:
E N D
Also By The Same Author:AKTiveAuthor, A Citation Graph Approach To Name Disambiguation AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer
Also By The Same Author • Name ambiguity a problem for automated information extraction. • Two problems: • Same name, different object: David L. Harris (Harvey Mudd College, formerly Stanford and MIT) and David L. Harris (Sandia Labs, Albuquerque) • Different name, same object: Professor Nick Jennings, Nicholas Jennings, N. R. Jennings.
Also By The Same Author • Existing Solutions: • By-hand disambiguation (eg DBLP). • Problem: slow, labour-intensive. • Text and context processing: Li et al (2005). • Problem: deals with names within text, not document authors. • Metadata machine-learning techniques: Han et al (2004, 2005). • Problem: Requires known ‘canonical’ set and 50% of data used in training.
Also By The Same Author • AKTiveAuthor: Linking together paper authors using metadata analysis. • Specifically based on the following observation: • People cite their own work. When they cite an author with a similar name, 95-98% of the time it is the same person. • Step one: Initial clustering on last name.
Also By The Same Author • Self-citation analysis: • Within a name-cluster, test papers against each other. • Does paper A appear in the bibliography of paper B, or vice versa? • Iteratively use this approach to build groups of papers, each representing one real-world author.
Also By The Same Author • Co-authorship Analysis: • Standard approach in disambiguation (Han et al) and social network analysis (AKT Ontocopi). • Use co-authorship relationships to further match the groups created in the self-citation stage. • Source URL Analysis: • Extra linking provided using the ‘source URL’ metadata field. • Links papers by same author on different subjects across one time period.
Also By The Same Author • Sanity Check: • Before committing to a ‘join’ on any of the three stages, check to see if it’s obviously not the same person. • Eg Norman L. Johnson and David E. Johnson (self-citation match). • Eg Earl and Erik Johnson (co-authorship match). • Eg Nicholas Jennings and N. Jennings allowed.
Also By The Same Author • Metrics: • Essentially an information retrieval exercise. • Three measures, each per individual paper: • Precision: (number of relevant docs retrieved) / (number of docs retrieved). • Recall: (number of relevant docs retrieved) / (number of relevant docs overall). • F-measure: Harmonic mean of Precision and Recall, used as generic measure of IR success.
Also By The Same Author • Results: • Tested eight name-clusters, checking against by-hand disambiguated results. • Precision ranged from 0.991 to 1.000 (mean 0.997). • Recall ranged from 0.705 to 0.935 (mean 0.818) • F-measure ranged from 0.824 to 0.965 (mean 0.899)
Also By The Same Author • Analysis / Conclusions: • Precision higher than recall, mainly due to sanity check. • All three methods (self-citation, co-authorship and url source analysis) needed for best results. • Heavily-dominated name-clusters give best results (eg Giles (81.6% C Lee Giles)). • Large and small name-clusters equally good.
Also By The Same Author • Future Work: • Original purpose: citation graph services, eg ‘view my papers’, ‘count my citations’, ‘calculate my impact’. • Improving the disambiguation algorithm: institutional affiliation data, tightening up co-authorship, better initial clustering.
Also By The Same Author • Questions?