Introduction to Research Informatics -Data Mining and Visualization BMI 5740 Spring 2013

Introduction to Research Informatics-Data Mining and VisualizationBMI 5740 Spring 2013 Yang Xiang, Ph.D.yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University

Outline • Data Mining Background • Association Rule Learning • Basic concepts • Frequent itemset mining, Apriori principle • Closed itemsets, maximal frequent itemsets • Graph/Network Mining • Basic concepts • Bipartite graph mining • Transactional Data Summarization and Visualization • Knowledge Discovery and Data Mining in Research Informatics • Transactional Data Transformation/Evaluation • Mining (genomic) network data • Indexing the Unified Medical Language System for knowledge discovery

Data Mining BackgroundBiomedical Data Types and Sources • Biomedical Data • Genomic Data • Clinical Data • Electronic Health Record … • Data sources • Gene Expression Omnibus (GEO) • The Cancer Genome Atlas (TCGA) • Unified Medical Language System (UMLS) … Many data are transactional data or other types of network data

Data Mining Workflow Data Selection and Pre-processing Preprocessed data Data Mining raw results Post-processing results cross-validation and evaluation Knowledge

Data Mining Methods • Association Rule Learning/Frequent Itemset Mining  Handling transactional data (which is essentially a type of graph data) • Graph/Network Mining  Handling various graph data • Clustering • Typical Machine Learning Approaches • Artificial Neural Networks • Decision Trees • Support Vector Machines • K-Nearest Neighbor • Bayesian Methods …

Association Rule Learning Grocery Store transactions Many transactional data exist in biomedical field, such as: (1) Electronic Health Records (2) Gene Phenotype associations (3) Drug side-effect associations … Some rules observed: {Bread,Beer}{Coke} {Diaper}{Eggs} Reference: Jiawei Han, MichelineKamber, Jian Pei, Data Mining Concepts and Techniques, Third Edition, 2011

Important Concepts and Definitions • Item/Transaction • Itemset • Support of an itemsetsup({bread, beer})=2/5 • Frequent ItemsetAn itemset whose support is greater than or equal to a minimum support threshold • Confidence of an association rule XYMeasures how often Y appears in transactions containing X Questions: 1. Given minimum support 0.6, is {bread,coke} a frequent itemset?2. What is the confidence of the rule {Bread,Beer}{Coke}

Association Rule Mining • Goals: Searching high confidence and high support rules • Approaches: • Brute-force (computationally prohibitive) • Two step approach: • Frequent Itemset Generation • Rule Generation

Frequent Itemset Generation/Mining • Brute-force approach (computationally prohibitive) • Apriori principle: • For any two itemsets X and Y, if X⊆Y , support(X)≥ support(Y) • Pruning the search tree (branch and bound) using Apriori principle

Limitation of Frequent Itemset Mining Let minimum support be 0.4, then {Bread, Beer, Coke} is a Frequent Itemset. Consequently, {Bread, Beer}, {Bread, Coke}, {Beer, Coke}, {Bread}, {Beer}, {Coke} areFrequent Itemsets. Observation: All nonempty subsets of a frequent itemset is frequent. • Solutions: • Limit the length of frequent itemsets generated; • Mining Closed Itemset; • Mining Maximal Itemset.

Closed Itemset and Maximal Frequent Itemset • An Itemset X is closed if none of its immediate supersets has the same support as X. • An Itemset X is maximal frequent if none of its immediate supersets is frequent Questions: Give a closed itemset in the example. Give a maximal frequent itemset in the example, assuming minimum support is 0.6

Closed Itemset and Maximal Frequent Itemset v Frequent Itemsets Frequent Closed Itemsets v Maximal Frequent Itemsets Closed Itemsets (not frequent)

Association Rule Learning Software • Free Software • MAFIA (http://himalaya-tools.sourceforge.net/Mafia/) • WEKA (http://www.cs.waikato.ac.nz/ml/weka/) … • Commercial Software • KXEN (http://www.kxen.com/) • STATISTICA (http://www.statsoft.com) …

Graph/Network Mining Undirected Directed graph Bipartite graph

Transactional data{0,1}-matrix  Bipartite Graph Bread 1 Milk 2 Diaper 3 Beer 4 Eggs 5 Coke Transactional Database Apples {0,1}-matrix

Mining Bipartite Graphs • Biclique (Catersian Product, Hyperrectangle, Block, Tile, etc.) • Maximal Biclique Figure source: Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan, Summarizing transactional databases With overlapped hyperrectangles, Data Mining and Knowledge Discovery, (2011) 23:215-251, Springer.

Mining Bipartite Graphs Associate Rule Learning Bread 1 Milk 2 Diaper 3 Beer 4 Eggs 5 Coke Transactional Database Apples Closed Itemsets Maximal Bicliques

i1 i2 i8 i9 t1 i1 i2 i3 i4 i5 i6 i7 i8 i9 t2 {t1,t2,t7,t8}X{i1,i2,i8,i9} t7 t1 t8 t2 i4 i5 i6 t3 t4 t4 {t4,t5}X{i4,i5,i6} t5 t5 t6 i2 i3 i7 i8 t7 t2 t8 t3 {t2,t3,t6,t7}X{i2,i3,i7,i8} t6 t7 Research: Transactional Data Summarization and Visualization • Motivation • Too many itemsets generated by the frequent itemset mining • Maximal frequent itemsets are still not succinct enough in many cases • Our work: Succinct transactional data summarization Total Covering Cost=8+5+8=21 Reference: Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan, Summarizing transactional databases With overlapped hyperrectangles, Data Mining and Knowledge Discovery, (2011) 23:215-251, Springer.

Research: Transactional Data Summarization and Visualization • Given a set of discovered submatrices of a (0,1) matrix, how can we reorder the rows and columns of the data matrix to best display these submatrices and their relationship? Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

i1 i2 i3 i4 i5 i6 i7 i8 i9 i4 i3 i8 i1 t1 i5 t2 HG1 i9 i2 i7 i6 t3 t4 t5 t3 t2 t6 t1 t4 HG2 t7 t5 t8 t6 t8 t7 Research: Transactional Data Summarization and Visualization • Relationship between matrix visualization cost and hypergraph cost Reference: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

Visualization effects Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

Visualization effects (continued) Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

Knowledge Discovery and Data Mining in Research InformaticsTransactional Data Transformation/Evaluation A question from matrix completion: how to complete or evaluation a (0,1)-matrix? Consider each transaction is a customer. What is each customer’s altitude towards un-purchased items (i.e., 0 entries)?

Support Pattern Measurementused in this work Biomedical Informatics question: How to efficiently transform M into F defined above, such that F can unbiasedly predict the unkown gene-phenotype relationships? Figure Source and Reference : Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

Find support patterns and calculate F (i,j) for one entry c e i j b g d f h a e i j b c g d f h a 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Find support patterns for the magenta entry (4,d) b c e g f 2 b 2 3 c 3 5 e f 6 5 g 8 6 Find the maximum edge bicliqueF (4,d)=6 8

IndEvi Algorithm in a Nutshell • Assume input is a set of maximal cliques of the original (0,1)-matrix. • Project each maximal clique horizontally and vertically. Let C be the maximal clique as shown by the shaded area. Can you figure out how to calculate FC(i,j) for an entry (i,j)? • Each entry will remember the largest FC(i,j).with respect to all Cs.Please refer to the paper for the algorithm detail. Figure Source and Reference : Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

Application in Prioritizing Human Disease Genes • Transactional data: gene-to-phenotype (G2P) dataset from http://human-phenotype-ontology.org (10/03/2010) • Closed itemset generator: MAFIAhttp://himalaya-tools.sourceforge.net/Mafia/ • Platform: Linux, C++, STL • Cross-validate Platform (10/04/2010): www.geneanswers.com (GACOM)

Results • Among all 34503(=|E|) known gene-phenotype relations, 4598(=|E’|) of them with gene ranked among the top 0.1107% (=y%) of the 1807 candidate genes for it, achieving a 120.4 (x/y=13.3264/0.1107) fold-enrichment. • Rank Cutoff Figure Source and Reference : Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

Case Study: Colon Cancer Table Source: Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

Case Study: Breast Cancer Table Source: Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

Case Study: Osteoarthritis • Supporting pattern (by IndEviRe) for TNXB: {COL3A1, COL5A1, COL5A2, TNXB}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, JOINT DISLOCATION, MITRAL VALVE PROLAPSE, SOFT SKIN, OSTEOARTHRITIS} • Supporting pattern (by IndEviRe) for VWF: {COL3A1, COL5A1, COL5A2, TNXB, VWF}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, , MITRAL VALVE PROLAPSE, OSTEOARTHRITIS} Table Source and reference: Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

Knowledge Discovery and Data Mining in Research InformaticsMining (genomic) network data • Gene Expression gene1 gene2 gene3 … S1 S2 S3 … gene1 gene1 gene2 gene2 gene3 gene3 … … • Pearson Correlation • Spearman Correlation • Distance Correlation, 2009 • Maximum Information Coefficient (MIC), 2011

Finding Highly Correlated gene clusters gene1 gene2 gene3 … gene1 gene2 gene1 gene3 gene3 … gene2

Dense subgraph Mining Clique Enumeration and Merging

Study and Validate of DiscoveredGene Networks • Gene enrichment testhttp://toppgene.cchmc.org/ • Survival test (logrank test)e.g. http://www.biomedcentral.com/1471-2105/13/S2/S12/ • Ingenuity Pathway Analysishttp://www.ingenuity.com/ • Genomic locationshttp://genome.ucsc.edu/ • Wet lab experimentshttp://parvinlab.bmi.ohio-state.edu

Knowledge Discovery and Data Mining in Research InformaticsIndexing the UMLS for Knowledge Discovery • Unified Medical Language System (UMLS): A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: • Metathesaurus • Semantic Network • SPECIALIST Lexicon • Maintained by US National Library of Medicine • Website: http://www.nlm.nih.gov/research/umls/

UMLS - Metathesaurus • Number of biomedical concepts > 1 million • Stem from over 100 incorporated controlled source vocabularies: • ICD (International Statistical Classification of Diseases and Related Health Problems) • MeSH (Medical Subject Headings) • SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) • HUGO • OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

Detailed Data of UMLS Reference: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

Distance The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v? ?Query dG(1, 11) =3 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2

Path The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? 15 14 Find a path from1to11 11 13 10 12 6 7 8 9 3 4 5 1 2

Degree Distribution in the UMLS Graph Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

Decentralization Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

Application: Disease Gene Prioritization • 8,134 Disease concepts from OMIM (Online Mendelian Inheritance in Man), by selecting semantic type to be “Disease or Syndrome” or “Neoplastic Process”. • 29,333 Genes from HUGO (Human Genome) Reference: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

closeness measure and fold enrichment Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

Recall Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

Chronic Lymphocytic Leukemia (CLL) Table source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

Breast Carcinoma Table source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

ThanksQuestions?

Introduction to Research Informatics -Data Mining and Visualization BMI 5740 Spring 2013