1 / 49

Introduction to Research Informatics -Data Mining and Visualization BMI 5740 Spring 2013

Introduction to Research Informatics -Data Mining and Visualization BMI 5740 Spring 2013. Yang Xiang, Ph.D. yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University. Outline. Data Mining Background Association Rule Learning Basic concepts

mohawk
Download Presentation

Introduction to Research Informatics -Data Mining and Visualization BMI 5740 Spring 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Research Informatics-Data Mining and VisualizationBMI 5740 Spring 2013 Yang Xiang, Ph.D.yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University

  2. Outline • Data Mining Background • Association Rule Learning • Basic concepts • Frequent itemset mining, Apriori principle • Closed itemsets, maximal frequent itemsets • Graph/Network Mining • Basic concepts • Bipartite graph mining • Transactional Data Summarization and Visualization • Knowledge Discovery and Data Mining in Research Informatics • Transactional Data Transformation/Evaluation • Mining (genomic) network data • Indexing the Unified Medical Language System for knowledge discovery

  3. Data Mining BackgroundBiomedical Data Types and Sources • Biomedical Data • Genomic Data • Clinical Data • Electronic Health Record … • Data sources • Gene Expression Omnibus (GEO) • The Cancer Genome Atlas (TCGA) • Unified Medical Language System (UMLS) … Many data are transactional data or other types of network data

  4. Data Mining Workflow Data Selection and Pre-processing Preprocessed data Data Mining raw results Post-processing results cross-validation and evaluation Knowledge

  5. Data Mining Methods • Association Rule Learning/Frequent Itemset Mining  Handling transactional data (which is essentially a type of graph data) • Graph/Network Mining  Handling various graph data • Clustering • Typical Machine Learning Approaches • Artificial Neural Networks • Decision Trees • Support Vector Machines • K-Nearest Neighbor • Bayesian Methods …

  6. Association Rule Learning Grocery Store transactions Many transactional data exist in biomedical field, such as: (1) Electronic Health Records (2) Gene Phenotype associations (3) Drug side-effect associations … Some rules observed: {Bread,Beer}{Coke} {Diaper}{Eggs} Reference: Jiawei Han, MichelineKamber, Jian Pei, Data Mining Concepts and Techniques, Third Edition, 2011

  7. Important Concepts and Definitions • Item/Transaction • Itemset • Support of an itemsetsup({bread, beer})=2/5 • Frequent ItemsetAn itemset whose support is greater than or equal to a minimum support threshold • Confidence of an association rule XYMeasures how often Y appears in transactions containing X Questions: 1. Given minimum support 0.6, is {bread,coke} a frequent itemset?2. What is the confidence of the rule {Bread,Beer}{Coke}

  8. Association Rule Mining • Goals: Searching high confidence and high support rules • Approaches: • Brute-force (computationally prohibitive) • Two step approach: • Frequent Itemset Generation • Rule Generation

  9. Frequent Itemset Generation/Mining • Brute-force approach (computationally prohibitive) • Apriori principle: • For any two itemsets X and Y, if X⊆Y , support(X)≥ support(Y) • Pruning the search tree (branch and bound) using Apriori principle

  10. Limitation of Frequent Itemset Mining Let minimum support be 0.4, then {Bread, Beer, Coke} is a Frequent Itemset. Consequently, {Bread, Beer}, {Bread, Coke}, {Beer, Coke}, {Bread}, {Beer}, {Coke} areFrequent Itemsets. Observation: All nonempty subsets of a frequent itemset is frequent. • Solutions: • Limit the length of frequent itemsets generated; • Mining Closed Itemset; • Mining Maximal Itemset.

  11. Closed Itemset and Maximal Frequent Itemset • An Itemset X is closed if none of its immediate supersets has the same support as X. • An Itemset X is maximal frequent if none of its immediate supersets is frequent Questions: Give a closed itemset in the example. Give a maximal frequent itemset in the example, assuming minimum support is 0.6

  12. Closed Itemset and Maximal Frequent Itemset v Frequent Itemsets Frequent Closed Itemsets v Maximal Frequent Itemsets Closed Itemsets (not frequent)

  13. Association Rule Learning Software • Free Software • MAFIA (http://himalaya-tools.sourceforge.net/Mafia/) • WEKA (http://www.cs.waikato.ac.nz/ml/weka/) … • Commercial Software • KXEN (http://www.kxen.com/) • STATISTICA (http://www.statsoft.com) …

  14. Graph/Network Mining Undirected Directed graph Bipartite graph

  15. Transactional data{0,1}-matrix  Bipartite Graph Bread 1 Milk 2 Diaper 3 Beer 4 Eggs 5 Coke Transactional Database Apples {0,1}-matrix

  16. Mining Bipartite Graphs • Biclique (Catersian Product, Hyperrectangle, Block, Tile, etc.) • Maximal Biclique Figure source: Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan, Summarizing transactional databases With overlapped hyperrectangles, Data Mining and Knowledge Discovery, (2011) 23:215-251, Springer.

  17. Mining Bipartite Graphs Associate Rule Learning Bread 1 Milk 2 Diaper 3 Beer 4 Eggs 5 Coke Transactional Database Apples Closed Itemsets Maximal Bicliques

  18. i1 i2 i8 i9 t1 i1 i2 i3 i4 i5 i6 i7 i8 i9 t2 {t1,t2,t7,t8}X{i1,i2,i8,i9} t7 t1 t8 t2 i4 i5 i6 t3 t4 t4 {t4,t5}X{i4,i5,i6} t5 t5 t6 i2 i3 i7 i8 t7 t2 t8 t3 {t2,t3,t6,t7}X{i2,i3,i7,i8} t6 t7 Research: Transactional Data Summarization and Visualization • Motivation • Too many itemsets generated by the frequent itemset mining • Maximal frequent itemsets are still not succinct enough in many cases • Our work: Succinct transactional data summarization Total Covering Cost=8+5+8=21 Reference: Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan, Summarizing transactional databases With overlapped hyperrectangles, Data Mining and Knowledge Discovery, (2011) 23:215-251, Springer.

  19. Research: Transactional Data Summarization and Visualization • Given a set of discovered submatrices of a (0,1) matrix, how can we reorder the rows and columns of the data matrix to best display these submatrices and their relationship? Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

  20. i1 i2 i3 i4 i5 i6 i7 i8 i9 i4 i3 i8 i1 t1 i5 t2 HG1 i9 i2 i7 i6 t3 t4 t5 t3 t2 t6 t1 t4 HG2 t7 t5 t8 t6 t8 t7 Research: Transactional Data Summarization and Visualization • Relationship between matrix visualization cost and hypergraph cost Reference: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

  21. Visualization effects Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

  22. Visualization effects (continued) Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

  23. Visualization effects (continued) Figure Source: Ruoming Jin, Yang Xiang, David Fuhry, Feodor F. Dragan, Overlapping matrix pattern visualization: A hypergraph approach, ICDM, 2008, 313-322.

  24. Knowledge Discovery and Data Mining in Research InformaticsTransactional Data Transformation/Evaluation A question from matrix completion: how to complete or evaluation a (0,1)-matrix? Consider each transaction is a customer. What is each customer’s altitude towards un-purchased items (i.e., 0 entries)?

  25. Support Pattern Measurementused in this work Biomedical Informatics question: How to efficiently transform M into F defined above, such that F can unbiasedly predict the unkown gene-phenotype relationships? Figure Source and Reference : Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

  26. Find support patterns and calculate F (i,j) for one entry c e i j b g d f h a e i j b c g d f h a 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Find support patterns for the magenta entry (4,d) b c e g f 2 b 2 3 c 3 5 e f 6 5 g 8 6 Find the maximum edge bicliqueF (4,d)=6 8

  27. IndEvi Algorithm in a Nutshell • Assume input is a set of maximal cliques of the original (0,1)-matrix. • Project each maximal clique horizontally and vertically. Let C be the maximal clique as shown by the shaded area. Can you figure out how to calculate FC(i,j) for an entry (i,j)? • Each entry will remember the largest FC(i,j).with respect to all Cs.Please refer to the paper for the algorithm detail. Figure Source and Reference : Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

  28. Application in Prioritizing Human Disease Genes • Transactional data: gene-to-phenotype (G2P) dataset from http://human-phenotype-ontology.org (10/03/2010) • Closed itemset generator: MAFIAhttp://himalaya-tools.sourceforge.net/Mafia/ • Platform: Linux, C++, STL • Cross-validate Platform (10/04/2010): www.geneanswers.com (GACOM)

  29. Results • Among all 34503(=|E|) known gene-phenotype relations, 4598(=|E’|) of them with gene ranked among the top 0.1107% (=y%) of the 1807 candidate genes for it, achieving a 120.4 (x/y=13.3264/0.1107) fold-enrichment. • Rank Cutoff Figure Source and Reference : Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

  30. Case Study: Colon Cancer Table Source: Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

  31. Case Study: Breast Cancer Table Source: Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

  32. Case Study: Osteoarthritis • Supporting pattern (by IndEviRe) for TNXB: {COL3A1, COL5A1, COL5A2, TNXB}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, JOINT DISLOCATION, MITRAL VALVE PROLAPSE, SOFT SKIN, OSTEOARTHRITIS} • Supporting pattern (by IndEviRe) for VWF: {COL3A1, COL5A1, COL5A2, TNXB, VWF}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, , MITRAL VALVE PROLAPSE, OSTEOARTHRITIS} Table Source and reference: Yang Xiang, Philip R.O. Payne, Kun Huang, Transactional database transformation and its application in prioritizing human disease genes, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012, 9(1), 294-304.

  33. Knowledge Discovery and Data Mining in Research InformaticsMining (genomic) network data • Gene Expression gene1 gene2 gene3 … S1 S2 S3 … gene1 gene1 gene2 gene2 gene3 gene3 … … • Pearson Correlation • Spearman Correlation • Distance Correlation, 2009 • Maximum Information Coefficient (MIC), 2011

  34. Finding Highly Correlated gene clusters gene1 gene2 gene3 … gene1 gene2 gene1 gene3 gene3 … gene2

  35. Dense subgraph Mining Clique Enumeration and Merging

  36. Study and Validate of DiscoveredGene Networks • Gene enrichment testhttp://toppgene.cchmc.org/ • Survival test (logrank test)e.g. http://www.biomedcentral.com/1471-2105/13/S2/S12/ • Ingenuity Pathway Analysishttp://www.ingenuity.com/ • Genomic locationshttp://genome.ucsc.edu/ • Wet lab experimentshttp://parvinlab.bmi.ohio-state.edu

  37. Knowledge Discovery and Data Mining in Research InformaticsIndexing the UMLS for Knowledge Discovery • Unified Medical Language System (UMLS): A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: • Metathesaurus • Semantic Network • SPECIALIST Lexicon • Maintained by US National Library of Medicine • Website: http://www.nlm.nih.gov/research/umls/

  38. UMLS - Metathesaurus • Number of biomedical concepts > 1 million • Stem from over 100 incorporated controlled source vocabularies: • ICD (International Statistical Classification of Diseases and Related Health Problems) • MeSH (Medical Subject Headings) • SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) • HUGO • OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

  39. Detailed Data of UMLS Reference: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  40. Distance The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v? ?Query dG(1, 11) =3 15 14 11 13 10 12 6 7 8 9 3 4 5 1 2

  41. Path The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? 15 14 Find a path from1to11 11 13 10 12 6 7 8 9 3 4 5 1 2

  42. Degree Distribution in the UMLS Graph Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  43. Decentralization Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  44. Application: Disease Gene Prioritization • 8,134 Disease concepts from OMIM (Online Mendelian Inheritance in Man), by selecting semantic type to be “Disease or Syndrome” or “Neoplastic Process”. • 29,333 Genes from HUGO (Human Genome) Reference: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  45. closeness measure and fold enrichment Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  46. Recall Figure source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  47. Chronic Lymphocytic Leukemia (CLL) Table source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  48. Breast Carcinoma Table source: Yang Xiang, Kewei Lu, Stephen L James, Tara B Borlawsky, Kun Huang, Philip R.O. Payne, k-Neighborhood decentralization: A comprehensive solution to index the UMLS for large scale knowledge discovery, Journal of Biomedical Informatics, 2012, 45(2), pp 323-336.

  49. ThanksQuestions?

More Related