텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (Detecting Bioinformatics by Text Mining Techniques)

텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (Detecting Bioinformatics by Text Mining Techniques) Min Song, PhD Associate Professor Department of Library and Information Science Yonsei University

Outline • Introduction and Background • Research Problem • Methods • Data Processing • Topic Modeling • Citation Analysis • Identification of Important Articles by PageRank • Visualization • Results & Discussion • Summary & Future Work

Introduction • Bioinformatics has grown into the cross-disciplinary field and proliferated into new areas of life Sciences • 400,000 biological researchers – worldwide • sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed Central • Understanding the trends in and the structure of Bioinformatics is increasingly important • Bibliometric analysis has been applied to Bioinformatics for this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)

Research Problem • Bibliometric analysis utilizes quantitative analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996) • Problems of Current Approaches • The current bibliometricanalysis relies primarily on Thomson’s Web of Science product which results in the following problems: • Manually processing citation data • Incomplete coverage • Only use citation analysis • Can’t handle big data

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Goal • Detecting the trends in and the structure of the field of Bioinformatics • We introduce novel techniques to detect the knowledge structure of and trends in Bioinformatics by Text Mining techniques and automated citation analysis • Mining PubMed Central full-text with • topic modeling • word co-occurrence • named entity recognition • MeSH • Author co-citation analysis • Visualization

What is PubMed Central? • PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of biomedical and life sciences journal literature • Provides free and unrestricted access (XML format) • Integrates journal literature with other valuable information resources in the NCBI database family (e.g., PubMed, Nucleotide, Protein) • Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique visitors in April 2008

Citation Analysis • Citation Graphs • Link-based algorithms • PageRank Representative Publications Combine Topic modeling Bibliographic coupling (BC) QUANTIFY SIMILARITIES Text-based Citation-based Documents Term co-occurrence Co-citation

Methods – Data Collections Total 20,869 articles from 47 Journals

Overall Procedure of Our Approach

MeSH = Medical Subject Headings

Word co-occurrence analysis and MeSHterm frequency • Important concept identifications by word co-occurrence • The most widely used measure of co-occurrence is mutual information (MI) • We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams • Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article • MeSH terms are not assigned to PubMed Central • Mapping from PubMed Central to PubMed record and then extract MeSH terms

Database Schema for a PubMedCentral Citation DB

Topic Modeling • Topic Modeling by LDA • We are to explore the salient topics in core literature of Bioinformatics. • We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation • LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection • In LDA, each group is described as a random mixture over latent topics where each topic is a discrete distribution over the vocabulary of the collection

D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

NER-based Detection of Organization and Country • We apply a Named Entity Recognition (NER) technique to identify country and organization from the text

Citation Analysis • Build a Citation Network from the Datasets • 990,000 citation nodes from about 20,000 papers • Apply the PageRank algorithm to the network to identify the important articles Citation Network (Complexity and Social Networks, 2012)

PageRank - definition • u: a web page • Fu: set of pages u points to • Bu: set of pages that point to u • Nu=|Fu|: the number of links from u • c: a factor used for normalization • The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges. • The definition corresponds to the probability • distribution of a random walk on the web graphs.

Results and Discussion • Term Co-occurrence Analysis Keywords with High Ranked Word Co-occurrence

Results and Discussion (Cont’d) Top Ranked Word Pairs by LLC

Results and Discussion (Cont’d) • Out of 20,869 documents, there are 19,954 documents that have the corresponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)

Results and Discussion (Cont’d) -Topic Modeling

Results and Discussion (Cont’d) Relationship between a paper and its citation

Results and Discussion (Cont’d) Publication productivity by year

Results and Discussion (Cont’d) Relationship between an author and the number of citations received

Results and Discussion (Cont’d) • Important Articles Identified by PageRank

Results and Discussion (Cont’d) Research productivity by country

Results and Discussion (Cont’d) Research Productivity by Institute

Visualization of author co-citation analysis All author-based co-citation analysis

First author-based co-citation analysis

Summary and Future Work • We analyzed the field of Bioinformatics bymining the full-text articles available in PubMed Central with Text Mining techniques • We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines. Future work • Identify research trends over time • Combining community detection and topic modeling • Author co-citation analysis vs. author collaboration analysis • All author-based vs. first author-based vs. important contributor-based • Compare to Web of Science data

References • Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S. and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) • Church, K., and Hanks, P., Word Association Norms, Mutual Information and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991). • Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics literature, Scientometrics, 67 : 477–489. • Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information Processing and Management, 42: 1578-1591 • Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007. • Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005. • Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinformatics. VOL 8. NO 2. 69-70

References • Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulligen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Milanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on information technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243 • Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving research trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95. • Glänzel W, Janssens F, Thijs B: A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics2009, 79:109-129. • Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993{1022, January 2003. • Huang H, Andrews J, Tang J: Citation characterization and impact normalization in bioinformatics journals. Journal of the American Society of Information Science and Technology 2011, doi: 10.1002/asi.21707

Questions? • Thank you! Questions? Thank You!

텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (Detecting Bioinformatics by Text Mining Techniques)