1 / 21

Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams

Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams. Mehmet Ali Abdulhayoglu Bart Thijs. OUTLINE. Introduction Methodology N-gram Notion Levenshtein (Edit) Distance Based on N-grams Kernel Discriminant Analysis Results Conclusion and Discussions.

sage
Download Presentation

Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams Mehmet Ali Abdulhayoglu Bart Thijs

  2. OUTLINE • Introduction • Methodology • N-gram Notion • Levenshtein (Edit) Distance Based on N-grams • Kernel Discriminant Analysis • Results • Conclusion and Discussions

  3. Introduction • CVs and publication lists of authors, applicants or institutions are used for the application of evaluative bibliometrics…(Job promotion, Institutional or macro level assessments) • It is generally needed to identify these publications in large databases like Web Of Science, Scopus and this process requires lots of manual work… • Automation of identifying publications in large databases save time and free up resources for manual cleaning

  4. Introduction • The main issue is to deal with the existence of different reference standards such as APA(American Psychological Association), MLA(Modern Language Association) etc. • They may have different sequencing for the components. That is, while co-author names are placed in the very beginning of the reference, they may be put at the end of the reference. Or some of them may use abbreviations for the authors or journal names.

  5. Introduction • Besides diverse standards, incomplete, erroneous or censored data in publication list or erroneous indexing in the database or changes in publication (title, number or sequence of co-authors, publication year…) • To grab the similarity of texts between the CV references and indexed publications in bibliometric databases, we applied a notion namely N-grams

  6. N-Grams Example:The diffusion of H-related literature • Word N-grams - adjacent sequence of n words from a given string (the diffusion of) (diffusion of h-related) (of h-related literature) • Character N-grams - adjacent sequence of n characters from a given string (_ _t) (_th) (the) (he_ ) (e_d ) (_di ) (dif)and so on

  7. N-grams • Word N-Grams are suitable for full text studies and not convenient for this study… • Character N-Grams are very powerful and handy especially for short texts and no need for stemming! • 3-grams are chosen considering the components’ lengths (e.g. author names, publication year, page) • Kondrak (2005) method for similarity measure Levenshtein (Edit) Distance based on character N-grams

  8. Modified LevenshteinDistance • The minimum number of single-character edits that have to be made in order to change one string to another (Levenshtein, 1966). • Operations: Add, Remove, Change • Kondrak (2005) improved this notion by using N-grams instead of single-character • For the N-gram based edit distance between strings x and y, a matrix is constructed where is the minimum number of edit operations needed to match to . Remove Add Change

  9. Levenshtein Distance

  10. Features of the Approach • Ordering is crucial… • For example similarity between: the diffusion vs. the diff. : 0,66 the diffusionvs. diff. the: 0,31 • Also, one can find two strings (Xanex and Nexan) having exactly the same N-gram decompositions which would give maximum similarity.

  11. Application • As can be expected that publication lists provide detailed bibliographic information about the publications such as its title, the journal title, the names of the author and co-author(s), publication year, volume and first and end page.

  12. Application Using these scores, we would like to decide whether the publication is indexed in the given database. Discriminant Analysis is such a convenient tool for this purpose.

  13. Kernel Discriminant Analysis • Since no assumptios are held for discriminant analysis, this non-parametric method is applied • Exploiting a kernel function (normal), it handles a non-linear mapping by a linear mapping in a feature space • As a result, it is based on estimating a non-parametric density function for the observations • There exist a smoothing parameter ‘r’ which determine the degree of irregularity in the estimate of density function. As suggested in Khattree and Naik (2000), we tried several ‘r’ and reach the optimal solution

  14. Kernel Discriminant Analysis • SCORE1, SCORE2, SCORE3, SCORE4, SCORE5 and SCORE6 • SCORE9, SCORE5 and SCORE8 • While the former set is chosen to examine the variables all including “Title” component and its variations, the latter one is chosen to analyse as a relatively more independent set with “Maximum”, “Title” and “Journal Name”.

  15. Data • Training Set 6525 real pairs of applicants’ CVs (correct matched pairs by manually) (Group 1) 3 x 6525 randomly unmatched pairs (Group 0) for the same publications in Group 1 • Test Set 2570 new pairs to be classified completely different from the ones in training set The publications are queried through a sample of WoS data having a size of 7387

  16. Results • false positive: wrongly assigned to Group 1 • false negative: wrongly assigned to Group 0

  17. Results • false positive: 36 • false negative: 95 • 94,3% of publications in Group 1 are classified correctly • 97,8% of publications estimated in Group 1 are classified correctly

  18. Results • Even though vast majority of the publications are classified correctly in Group 1, • Similarity scores for Group 1 between 0,30 – 0,45 false negatives • Similarity scores for Group 0 higher than 0,45 false positives

  19. Conclusion • By means of proposed model, 95% correct classification is achieved… • The matches which have a similarity score 0,6267 or higher indicate a precise correct matching… • For bibliometric evaluation processes, it will be useful to decrease the manual work However…

  20. Parts not to be ruled out! • Only for papers in English • “false positives” remains as an issue to be solved • Tolerance to “false positives” depends on the Application (Micro vs. Macro Level Studies)

  21. Thank you!

More Related