300 likes | 389 Views
Combining Full-text analysis & Bibliometric Indicators a pilot study. Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3. Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary)
E N D
Combining Full-text analysis & Bibliometric Indicatorsa pilot study Patrick Glenisson 1 Wolfgang Glänzel 1,2 Olle Persson 3 • Steunpunt O&O Statistieken, Katholieke Universiteit Leuven, Leuven (Belgium) • Institute for Research Organisation, Hungarian Academy of Sciences, Budapest (Hungary) • Inforsk, Department of Sociology, Umeå University, Umeå (Sweden)
Introduction • Goal: mapping of scientific processes • Map of scientific papers • Characterization of emerging clusters • Extraction of new search keys • Using bibliometric as well as lexical indicators of ‘relatedness’ • Full-text analysis
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Data source • 19 full-text papers from: Scientometrics, Vol 30, Issue 3 (2004) • special issue on 9th international conference on Scientometrics and Informetrics (Beijing, China) • Validation setup • Manual assignment in various classes ..
Research questions • Comparison text-basedmapping vs. expert classification • Extracted keywords • Comparison with bibliometric mapping
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Methodology • Given a set of documents,
Methodology • Given a set of documents, • compute a representation, called index <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>
Methodology • Given a set of documents, • compute a representation, called index • to retrieve, summarize, classify or cluster them <1 0 0 1 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0>
Methodology • Document processing • Remove punctuation & grammatical structure (‘Bag of words’ ) • Define a vocabulary • Identify Multi-word terms (e.g., tumor suppressor) (phrases) • Eliminate words low content (e.g., and, thus,.. ) (stopwords) • Map words with same meaning (synonyms) • Strip plurals, conjugations, ... (stemming) • Define weighing scheme and/or transformations (tf-idf,svd,..)
T 3 T 2 T 1 Similarity between documents Salton’s cosine: vocabulary Methodology • Compute index of textual resources:
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Results – Term statistics • 19 papers • 3610 withheld terms (including ~400 bigrams) • Distance Matrix (19x19) • Apply MDS • Apply Clustering
Results – MDS Policy Mathematicalapproaches Webometrics
Cut-off k=4 ? • Optimal parameters ? • ‘Stability-based method’ • Quantified correspondence with expert assignments ? • ‘Rand index’ .. Results – Clustering • Hierarchical clusteringWard method
Rand index = 0.778 p-value (w.r.t to permuted data) < 10-3 ; significant Results – Peer evaluation Webometrics Mathematicalapproaches Policy
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Results – Reference age Histograms per paper
Results – Reference age Histograms aggregated by expert class
Results – Ref Age vs. % Serial Scatter plot of Expert classes:Mean Reference Age vs. Percentage of Serials
Overview • Data sources and Questions asked • Text mining Ingredients • Text-based relational analysis of documents • Contrasts with bibliometric analysis • Term extraction from full-text • Conclusion
Results – Term extraction • Calculation of seminal keywords for each article • Using TF-IDF weighting scheme • Normalized to norm 1 to accommodate for document length
Results – Full-text vs Abstract • Is a full-text analysis warranted • for term extraction ? • for mapping purposes ?
Less structure • Less overlap with expert classes: Full-text is an interesting sourcefor additional keywords and improved mapping Rand index = 0.6257 p-value = 0.464 ; not significant Results – Full-text vs Abstract
Conclusion • Keyword approach may be naïve • But applied in a systematic framework in combination with ‘right’ algorithms, it provides interesting clues • Complementary to bibliometric approaches • Weak indications towards benefits of using full-text articles • Future: extension of this pilot to larger samples
References • Bibliometrics; homepage Wolfgang Glänzel • http://www.steunpuntoos.be/wg.html • Bibliometrics; homepage Olle Persson • http://www.umu.se/inforsk/Staff/olle.htm • Text & Data mining; PhD thesis Patrick Glenisson • ftp://ftp.esat.kuleuven.ac.be/pub/sista/glenisson/reports/phd.pdf • Optimal k in clustering; Stability method