150 likes | 157 Views
Explore the connections between disease and genomic data using data mining techniques. Discover patterns and associations that can lead to advancements in medical research and treatment.
E N D
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.eduwww.cs.umn.edu/~kumar Michael Steinbach Department of Computer Science
Increasing Amounts of Medical and Genomic Data • Electronic medical records are becoming increasingly common • Automated analysis of patient information is now possible • Obtaining genomic information is increasingly affordable • SNPs offer the potential of tests for disease or susceptibility for disease
Various Techniques Currently Used to Analyze the Relationship Between SNPs and Disease • Statistical Association Analysis • Chi-Square and Odds ratio • Logistic Regression • Multifactor Dimensionality Reduction • Attribute selection and construction followed by classification
Challenges Facing Current Techniques • Statistical Association Analysis • Often lacks power, even for single SNP association • Combinatorial explosion when used for screening pairs, triples, etc. • Logistic Regression • Does not work well when many attributes • Multifactor Dimensionality Reduction • Impractical for many attributes due to combinatorial nature • General Challenges • Disease or susceptibility to disease often results from interactions among many genetic, phenotypic, and environmental factors • Noise and nonlinear interactions • The Challenges of Whole-Genome Approaches to Common Diseases, Moore and Ritchie, JAMA, 2004
General Approach Using Data Mining Techniques • Create a data set that records the presence and absence of • Phenotypic characteristics • Genetic characteristics (SNPs) • Disease • Apply association analysis to find groups of phenotypic and genetic characteristics that are highly associated with disease • Uses characteristics of the patterns to prune the search space • Clustering and classification can also be applied
Traditional Association Analysis Set-Based Representation of Data • Association analysis: Analyzes relationships among items (attributes) in a binary transaction data • Example data: market basket data • Data can be represented as a binary matrix • Applications in business and science • Two types of patterns • Itemsets:Collection of items • Example: {Milk, Diaper} • Association Rules:XY, where X and Y are itemsets. • Example: Milk Diaper Binary Matrix Representation of Data
The Need for Error-Tolerant Itemsets • An error-tolerant itemset (ETI) can have a fraction of the items missing in each transaction. Example: see the data in the table • Let = 1/4. In other words, eachtransaction needs to have 3/4 (75%) of the items. • X = {i1, i2, i3, i4} andY = {i5, i6, i7, i8} are both ETIs with a support of 4. • Algorithms to find ETIs are still in development • You can think of these ETIs as blocks in the data matrix
ETIs in For Finding Patterns in Phenotypic and Genomic Data • ETIs consist of • A set of patients and • A set of attributes such that • The block is relatively dense • These blocks identify sets of patients that are highly associated with certain sets of attributes and vice-versa • If most of these patients share a disease, then these attributes (genetic and/or phenotypic) are candidate markers for the disease X: Set of patients Y: Set of attributes, i.e., SNPs, medical characteristics
Example: Using ETIs to Find Rules • Can use ETIs to find better association rules • Example: Mushroom data set • Classifies 8192 mushrooms as poisonous (3916) or not (4208) • 117 other attributes, such as color, odor, etc. • Comparison of rules based on frequent itemsets or ETIs • One rule is {29, 48, 90} → p • Individual correlations are 0.62, 0.54, 0.18 • With traditional association analysis, this rule has a confidence of 1 and a support of 576. • This corresponds to a correlation of 0.18 • If we require only two of the three items, we still have a confidence of 1, but the support is 3312. • This corresponds to a correlation of 0.86 • Maximum single item correlation is 0.78
Generalizing ETIs to Blocks in a Data Matrix • Dense blocks consist of • A set of objects and • A set of attributes such that • The block is relatively dense • These blocks identify sets of objects that are highly associated with certain sets of attributes and vice-versa • For gene expression data, this identifies transcription modules • Group of genes that are coexpressed together under a set of conditions X: Genes Y: Conditions under which genes are expressed Entries of matrix are the level of a gene’s expression under a condition
Techniques for Finding Relatively Dense Blocks in a Data Matrix • Algorithms for finding Error-Tolerant Itemsets • More work needed to develop algorithms • We are currently using support envelopes • Support Envelopes: A Technique for Exploring the Structure of Association Patterns, Steinbach, Tan, Kumar, KDD, 2004 • Subspace clustering • Similar in spirit to association analysis • Subspace clustering for high dimensional data: A review, Parsons , Haque , Liu SIGKDD Explorations, 2004 • Co-clustering • Information-Theoretic Co-Clustering, Dhillon, Mallela, Modha, KDD 2003 • We are currently exploring approaches to extend co-clustering for ETIs • Variety of other approaches • Matrix factorization • Graph-partitioning
Support Envelopes • Asupport envelopecontains all association patterns involving m or more transactions and n or more items • By association patterns we mean • Itemsets and variants (frequent, maximal, closed) • Error Tolerant Itemsets (ETIs) An example of a support envelope involving characteristics of mushrooms. One of the attributes is, ‘gill-color:buff’, which occurs in 1728 records, every one of which occurs with 13 other items (one of which is the attribute,‘poisonous’)
Visualizing Support Envelopes for Mushroom • One of the support envelopes (576, 23) is denser than its surrounding neighbors.
Using Weak Associations to Find Patterns • Apply the following principle: If most pairs of attributes or objects in a set have a pairwise connections, then there is likely to be a strong association among them even if the pairwise associations are weak. • The hyperclique pattern uses this principle • Hyperclique Pattern Discovery, Xiong Tan, and Vipin Kumar, to appear DMKD, 2006. • Good for removing noise • Enhancing Data Analysis with Noise Removal, Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, TKDE, 2006 • We have recently developed more general pairwise patterns • Custom Itemset Patterns, Steinbach and Kumar, submitted to KDD 2006
Application of Classification Techniques • Techniques must work with noisy, sparse, high-dimensional data • Success of multifactor dimensionality reduction indicates the usefulness of attribute selection and attribute creation • ETI and related patterns offer an alternative for feature extraction • Classification based on association analysis • SVM with the proper kernel