210 likes | 411 Views
Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams. Mehmet Ali Abdulhayoglu Bart Thijs. OUTLINE. Introduction Methodology N-gram Notion Levenshtein (Edit) Distance Based on N-grams Kernel Discriminant Analysis Results Conclusion and Discussions.
E N D
Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams Mehmet Ali Abdulhayoglu Bart Thijs
OUTLINE • Introduction • Methodology • N-gram Notion • Levenshtein (Edit) Distance Based on N-grams • Kernel Discriminant Analysis • Results • Conclusion and Discussions
Introduction • CVs and publication lists of authors, applicants or institutions are used for the application of evaluative bibliometrics…(Job promotion, Institutional or macro level assessments) • It is generally needed to identify these publications in large databases like Web Of Science, Scopus and this process requires lots of manual work… • Automation of identifying publications in large databases save time and free up resources for manual cleaning
Introduction • The main issue is to deal with the existence of different reference standards such as APA(American Psychological Association), MLA(Modern Language Association) etc. • They may have different sequencing for the components. That is, while co-author names are placed in the very beginning of the reference, they may be put at the end of the reference. Or some of them may use abbreviations for the authors or journal names.
Introduction • Besides diverse standards, incomplete, erroneous or censored data in publication list or erroneous indexing in the database or changes in publication (title, number or sequence of co-authors, publication year…) • To grab the similarity of texts between the CV references and indexed publications in bibliometric databases, we applied a notion namely N-grams
N-Grams Example:The diffusion of H-related literature • Word N-grams - adjacent sequence of n words from a given string (the diffusion of) (diffusion of h-related) (of h-related literature) • Character N-grams - adjacent sequence of n characters from a given string (_ _t) (_th) (the) (he_ ) (e_d ) (_di ) (dif)and so on
N-grams • Word N-Grams are suitable for full text studies and not convenient for this study… • Character N-Grams are very powerful and handy especially for short texts and no need for stemming! • 3-grams are chosen considering the components’ lengths (e.g. author names, publication year, page) • Kondrak (2005) method for similarity measure Levenshtein (Edit) Distance based on character N-grams
Modified LevenshteinDistance • The minimum number of single-character edits that have to be made in order to change one string to another (Levenshtein, 1966). • Operations: Add, Remove, Change • Kondrak (2005) improved this notion by using N-grams instead of single-character • For the N-gram based edit distance between strings x and y, a matrix is constructed where is the minimum number of edit operations needed to match to . Remove Add Change
Features of the Approach • Ordering is crucial… • For example similarity between: the diffusion vs. the diff. : 0,66 the diffusionvs. diff. the: 0,31 • Also, one can find two strings (Xanex and Nexan) having exactly the same N-gram decompositions which would give maximum similarity.
Application • As can be expected that publication lists provide detailed bibliographic information about the publications such as its title, the journal title, the names of the author and co-author(s), publication year, volume and first and end page.
Application Using these scores, we would like to decide whether the publication is indexed in the given database. Discriminant Analysis is such a convenient tool for this purpose.
Kernel Discriminant Analysis • Since no assumptios are held for discriminant analysis, this non-parametric method is applied • Exploiting a kernel function (normal), it handles a non-linear mapping by a linear mapping in a feature space • As a result, it is based on estimating a non-parametric density function for the observations • There exist a smoothing parameter ‘r’ which determine the degree of irregularity in the estimate of density function. As suggested in Khattree and Naik (2000), we tried several ‘r’ and reach the optimal solution
Kernel Discriminant Analysis • SCORE1, SCORE2, SCORE3, SCORE4, SCORE5 and SCORE6 • SCORE9, SCORE5 and SCORE8 • While the former set is chosen to examine the variables all including “Title” component and its variations, the latter one is chosen to analyse as a relatively more independent set with “Maximum”, “Title” and “Journal Name”.
Data • Training Set 6525 real pairs of applicants’ CVs (correct matched pairs by manually) (Group 1) 3 x 6525 randomly unmatched pairs (Group 0) for the same publications in Group 1 • Test Set 2570 new pairs to be classified completely different from the ones in training set The publications are queried through a sample of WoS data having a size of 7387
Results • false positive: wrongly assigned to Group 1 • false negative: wrongly assigned to Group 0
Results • false positive: 36 • false negative: 95 • 94,3% of publications in Group 1 are classified correctly • 97,8% of publications estimated in Group 1 are classified correctly
Results • Even though vast majority of the publications are classified correctly in Group 1, • Similarity scores for Group 1 between 0,30 – 0,45 false negatives • Similarity scores for Group 0 higher than 0,45 false positives
Conclusion • By means of proposed model, 95% correct classification is achieved… • The matches which have a similarity score 0,6267 or higher indicate a precise correct matching… • For bibliometric evaluation processes, it will be useful to decrease the manual work However…
Parts not to be ruled out! • Only for papers in English • “false positives” remains as an issue to be solved • Tolerance to “false positives” depends on the Application (Micro vs. Macro Level Studies)