1 / 30

Combining Full-text analysis & Bibliometric Indicators a pilot study

Combining Full-text analysis & Bibliometric Indicators a pilot study. Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary)

xanthe
Download Presentation

Combining Full-text analysis & Bibliometric Indicators a pilot study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Full-text analysis & Bibliometric Indicatorsa pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 • Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) • Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) • Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)

  2. Introduction • Goal: mapping of scientific processes • Map of scientific papers • Characterization of emerging clusters • Extraction of new search keys • Using bibliometric as well as lexical indicators of ‘relatedness’ • Full-text analysis

  3. Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

  4. Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

  5. Data source • 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004) •  special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China) • Validation setup • Manual assignment in various classes ..

  6. Data source

  7. Research questions • Comparison text-basedmapping vs. expert classification • Extracted keywords • Comparison with bibliometric mapping

  8. Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

  9. Methodology • Given a set of documents,

  10. Methodology • Given a set of documents, • compute a representation, called index  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>

  11. Methodology • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them  <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>

  12. Methodology • Document processing • Remove punctuation & grammatical structure (‘Bag of words’ ) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus,.. ) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..)

  13. T 3 T 2 T 1 Similarity between documents  Salton’s cosine: vocabulary Methodology • Compute index of textual resources:

  14. Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

  15. Results – Term statistics • 19 papers • 3610 withheld terms (including ~400 bigrams) • Distance Matrix (19x19) • Apply MDS • Apply Clustering

  16. Results – MDS

  17. Results – MDS Policy Mathematicalapproaches Webometrics

  18. Cut-off k=4 ? • Optimal parameters ? • ‘Stability-based method’ • Quantified correspondence with expert assignments ? • ‘Rand index’ .. Results – Clustering • Hierarchical clusteringWard method

  19. Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant Results – Peer evaluation Webometrics Mathematicalapproaches Policy

  20. Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

  21. Results – Reference age Histograms per paper

  22. Results – Reference age Histograms aggregated by expert class

  23. Results – Ref Age vs. % Serial Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials

  24. Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion

  25. Results – Term extraction • Calculation of seminal keywords for each article • Using TF-IDF weighting scheme • Normalized to norm 1 to accommodate for document length

  26. Results – Full-text vs Abstract • Is a full-text analysis warranted • for term extraction ? • for mapping purposes ?

  27. Less structure • Less overlap with expert classes: Full-text is an interesting sourcefor additional keywords and improved mapping Rand index = 0.6257 p-value = 0.464 ; not significant Results – Full-text vs Abstract

  28. Conclusion • Keyword approach may be naïve • But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues • Complementary to bibliometric approaches • Weak indications towards benefits of using full-text articles • Future: extension of this pilot to larger samples

  29. References • Bibliometrics; homepage Wolfgang Glänzel • http://www.steunpuntoos.be/wg.html • Bibliometrics; homepage Olle Persson • http://www.umu.se/inforsk/Staff/olle.htm • Text & Data mining; PhD thesis Patrick Glenisson • ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf • Optimal k in clustering; Stability method

More Related