1 / 15

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics

Explore the connections between disease and genomic data using data mining techniques. Discover patterns and associations that can lead to advancements in medical research and treatment.

Download Presentation

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.eduwww.cs.umn.edu/~kumar Michael Steinbach Department of Computer Science

  2. Increasing Amounts of Medical and Genomic Data • Electronic medical records are becoming increasingly common • Automated analysis of patient information is now possible • Obtaining genomic information is increasingly affordable • SNPs offer the potential of tests for disease or susceptibility for disease

  3. Various Techniques Currently Used to Analyze the Relationship Between SNPs and Disease • Statistical Association Analysis • Chi-Square and Odds ratio • Logistic Regression • Multifactor Dimensionality Reduction • Attribute selection and construction followed by classification

  4. Challenges Facing Current Techniques • Statistical Association Analysis • Often lacks power, even for single SNP association • Combinatorial explosion when used for screening pairs, triples, etc. • Logistic Regression • Does not work well when many attributes • Multifactor Dimensionality Reduction • Impractical for many attributes due to combinatorial nature • General Challenges • Disease or susceptibility to disease often results from interactions among many genetic, phenotypic, and environmental factors • Noise and nonlinear interactions • The Challenges of Whole-Genome Approaches to Common Diseases, Moore and Ritchie, JAMA, 2004

  5. General Approach Using Data Mining Techniques • Create a data set that records the presence and absence of • Phenotypic characteristics • Genetic characteristics (SNPs) • Disease • Apply association analysis to find groups of phenotypic and genetic characteristics that are highly associated with disease • Uses characteristics of the patterns to prune the search space • Clustering and classification can also be applied

  6. Traditional Association Analysis Set-Based Representation of Data • Association analysis: Analyzes relationships among items (attributes) in a binary transaction data • Example data: market basket data • Data can be represented as a binary matrix • Applications in business and science • Two types of patterns • Itemsets:Collection of items • Example: {Milk, Diaper} • Association Rules:XY, where X and Y are itemsets. • Example: Milk  Diaper Binary Matrix Representation of Data

  7. The Need for Error-Tolerant Itemsets • An error-tolerant itemset (ETI) can have a fraction  of the items missing in each transaction. Example: see the data in the table • Let  = 1/4. In other words, eachtransaction needs to have 3/4 (75%) of the items. • X = {i1, i2, i3, i4} andY = {i5, i6, i7, i8} are both ETIs with a support of 4. • Algorithms to find ETIs are still in development • You can think of these ETIs as blocks in the data matrix

  8. ETIs in For Finding Patterns in Phenotypic and Genomic Data • ETIs consist of • A set of patients and • A set of attributes such that • The block is relatively dense • These blocks identify sets of patients that are highly associated with certain sets of attributes and vice-versa • If most of these patients share a disease, then these attributes (genetic and/or phenotypic) are candidate markers for the disease X: Set of patients Y: Set of attributes, i.e., SNPs, medical characteristics

  9. Example: Using ETIs to Find Rules • Can use ETIs to find better association rules • Example: Mushroom data set • Classifies 8192 mushrooms as poisonous (3916) or not (4208) • 117 other attributes, such as color, odor, etc. • Comparison of rules based on frequent itemsets or ETIs • One rule is {29, 48, 90} → p • Individual correlations are 0.62, 0.54, 0.18 • With traditional association analysis, this rule has a confidence of 1 and a support of 576. • This corresponds to a correlation of 0.18 • If we require only two of the three items, we still have a confidence of 1, but the support is 3312. • This corresponds to a correlation of 0.86 • Maximum single item correlation is 0.78

  10. Generalizing ETIs to Blocks in a Data Matrix • Dense blocks consist of • A set of objects and • A set of attributes such that • The block is relatively dense • These blocks identify sets of objects that are highly associated with certain sets of attributes and vice-versa • For gene expression data, this identifies transcription modules • Group of genes that are coexpressed together under a set of conditions X: Genes Y: Conditions under which genes are expressed Entries of matrix are the level of a gene’s expression under a condition

  11. Techniques for Finding Relatively Dense Blocks in a Data Matrix • Algorithms for finding Error-Tolerant Itemsets • More work needed to develop algorithms • We are currently using support envelopes • Support Envelopes: A Technique for Exploring the Structure of Association Patterns, Steinbach, Tan, Kumar, KDD, 2004 • Subspace clustering • Similar in spirit to association analysis • Subspace clustering for high dimensional data: A review, Parsons , Haque , Liu SIGKDD Explorations, 2004 • Co-clustering • Information-Theoretic Co-Clustering, Dhillon, Mallela, Modha, KDD 2003 • We are currently exploring approaches to extend co-clustering for ETIs • Variety of other approaches • Matrix factorization • Graph-partitioning

  12. Support Envelopes • Asupport envelopecontains all association patterns involving m or more transactions and n or more items • By association patterns we mean • Itemsets and variants (frequent, maximal, closed) • Error Tolerant Itemsets (ETIs) An example of a support envelope involving characteristics of mushrooms. One of the attributes is, ‘gill-color:buff’, which occurs in 1728 records, every one of which occurs with 13 other items (one of which is the attribute,‘poisonous’)

  13. Visualizing Support Envelopes for Mushroom • One of the support envelopes (576, 23) is denser than its surrounding neighbors.

  14. Using Weak Associations to Find Patterns • Apply the following principle: If most pairs of attributes or objects in a set have a pairwise connections, then there is likely to be a strong association among them even if the pairwise associations are weak. • The hyperclique pattern uses this principle • Hyperclique Pattern Discovery, Xiong Tan, and Vipin Kumar, to appear DMKD, 2006. • Good for removing noise • Enhancing Data Analysis with Noise Removal, Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, TKDE, 2006 • We have recently developed more general pairwise patterns • Custom Itemset Patterns, Steinbach and Kumar, submitted to KDD 2006

  15. Application of Classification Techniques • Techniques must work with noisy, sparse, high-dimensional data • Success of multifactor dimensionality reduction indicates the usefulness of attribute selection and attribute creation • ETI and related patterns offer an alternative for feature extraction • Classification based on association analysis • SVM with the proper kernel

More Related