Association Analysis Techniques for Bioinformatics Problems

Association Analysis Techniques forBioinformatics Problems Vipin Kumar University of Minnesota kumar@cs.umn.edu www.cs.umn.edu/~kumar Group members: Gaurav Pandey, Gowtham Atluri, Gang Fang, Rohit Gupta, Michael Steinbach www.cs.umn.edu/~kumar/dmbio BICoB 2009 April 8th, 2009

Data Mining for Biomedical Informatics Protein Interaction Network • Recent technological advances are helping to generate large amounts of genomic data • Gene and protein sequences • Gene-expression data • Biological networks and phylogenetic profiles • Single Nucleotides Polymorphisms (SNPs) • Data mining offers potential solution for analysis of large-scale data • Prediction of the functions of anonymous genes • Identification and summarizarion of functional modules • Associations between genotypes and phenotype of interest

Association Analysis Set-Based Representation of Data • Association analysis: Analyzes relationships among items (attributes) in a binary transaction data • Example data: market basket data • Applications in business and science • Marketing and Sales Promotion • Identification of functional modules from protein complexes • Noise removal from protein interaction data • Two types of patterns • Itemsets:Collection of items • Example: {Milk, Diaper} • Association Rules:X  Y, where X and Y are itemsets. • Example: Milk  Diaper

Association Analysis • Process of finding interesting patterns: • Find frequent itemsets using a support threshold • Find association rules for frequent itemsets • Sort association rules according to confidence • Support filtering is necessary • To eliminate spurious patterns • To avoid exponential search • Support has anti-monotone property:X  Y implies (Y) ≤ (X) • Confidence is used because of its interpretation as conditional probability • Has well-known limitations • More advanced measures: H-confidence Given d items, there are 2d possible candidate itemsets

Different Association patterns • Traditional frequent pattern • Hyperclique pattern • Error-tolerant frequent pattern • Discriminative pattern Case Controls

Association Analysis for Identification of Functional Modules from Protein Complex Data • A protein complex is a group of two or more associated proteins formed by protein-protein interaction that is stable over time • Gavin et al (2002)’s data set contains 232 complexes covering 1361 proteins • Can be represented as a market-basket matrix • Hypercliques derived from this data can be used to discover frequently occurring groups of proteins in several complexes • Likely to constitute functional modules Xiong et al, “Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery”, Proc. Pac. Symp. Biocomput. , pp 540-549, 2005.

Function Coherence of Hypercliques GO enrichment of {PRE2 PRE4 PRE5 PRE6 PRE8 PRE9 PUP3 SCL1}.

Association Analysis-based Pre-processing of Protein Interaction Networks Protein interaction data faces significant data quality challenges Significant noise and incomplete [Hart et al, 2006] Adverse impact on accuracy of functional inferences [Deng et al, 2003] • Approach: Transform graph G=(V,E,W) into G’=(V,E’,W’) • Tries to meet three objectives: • Addition of potentially biologically valid edges • Removal of potentially noisy edges • Assignment of weights to the resultant set of edges that indicate their reliability Transformed PPI graph where Pi and Pj are connected if (Pi,Pj) is a hyperclique pattern Input PPI graph Pandey et al, “Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study”, ACM SIGKDD 2007, pp 540-549 (Also selected for a Highlight talk at ISMB 2008).

Results on several S. cerevisiae interaction data sets Accuracy of top-k predictions Combined from 3 interaction data sets (initial edge reliabilities from expression data). DIPCore (Highly reliable subset of DIP database). Results on simulated data sets show that this transformation is able to reduce the effect of noise on function prediction significantly.

Constant value biclusters • Constant row • biclusters • Constant evolution • biclusters • Constant value biclusters An Association Analysis Approach for Biclustering Hystad et. al. 2007 • Bicluster: Group of genes that show coherent expression across a subset of conditions in a geneexpression data set. • Shown to be more effective than hierarchical clustering for finding functional modules [Prelic et al, 2006] • Four general classes of biclusters studied [Madeira et al, 2004]

Biclustering Algorithms and Associated Issues • Several biclustering algorithms have been proposed • Cheng & Church (2000) (CC): First approach for biclustering microarray data • ISA [Bergmann et al, 2003] • SAMBA [Tanay et al, 2004], Co-clustering [Dhillon et al, 2003] • xMotifs [Murali & Kasif, 2003], OPSM [Ben-Dor et al, 2003] • Typically these approaches work in one of the two ways: • Start with all rows and columns and work in a top-down fashion by eliminating rows and columns based on some greedy strategy (CC). • Start with a random seed of rows and columns and iteratively zero in on the final set of genes and conditions (ISA). • Face several challenges: • Cannot find biclusters exhaustively due to heuristic nature. • Typically find large biclusters due to the objective function being achieved early and can miss small interesting biclusters.

Association Analysis for Biclustering • Association patterns are biclusters! • Each pattern supported by a subset of transactions • Can formulate a framework for directly extracting such patterns from real-valued data (microarray). • Advantages: Exhaustive and efficient discovery of biclusters • Disadvantages: Need to binarize/discretize the original real-valued data set for association mining [Becquet et al, 2002; Creighton et al, 2003; McIntosh et al, 2007], which causes a loss of information.

Range Support Patterns (RAP) • Need an anti-monotonic measure to compute the significance of patterns discovered directly from real-valued data. • We propose Range Support: Sum of RS over all conditions. • Constraints imposed: • Consistency of expression values • Same direction of expression • These conditions satisfied over substantial number of conditions • Range support is anti-monotonic! • Can be used within an Apriori-like framework [Agrawal et al 1993; 1994] to find patterns. • Implementation available at http://www.cs.umn.edu/vk/gaurav/rap. Pandey et al, “Association Analysis for Real-valued Data: Definitions and Application to Microarray Data”, TR 08-007, Deptt of Comp Sc & Engg, University of Minnesota.

Experimental Evaluation • Patterns derived at various thresholds from Hughes et al (2000)’s microarray compendium. • Compared with ISA and CC biclusters. • RAP3: RangeSupport=8, α=0.5; RAP5: RangeSupport=12, α=1. • Evaluation using bicluster coherence and GO annotations • Coherence measured using Mean Squared Error (Cheng & Church, 2000). • Functional enrichment computed for each set of biclusters using GOTermFinder’s methodology (Boyle et al, 2004). • Fraction of patterns enriched measured at various p-value thresholds. • Classes separated into big (31-500 members) and small (1-30 members). • Evaluation restricted to 100 least overlapping patterns, to reduce the effect of redundancy (Prelic et al, 2006).

Bicluster coherence using MSE Cheng and Church, 2000

Examples of biclusters found CC1 bicluster with the lowest MSE ISA2 bicluster with the lowest MSE RAP3 bicluster with maximum number of genes RAP3 bicluster with highest MSE

GO enrichment results Functional enrichment for large classes (31-500 members) Functional enrichment for small classes (1-30 members) Results from randomized patterns show that the results at the strictest p-value thresholds are the most statistically significant

GO enrichment results for several sets of small functional classes Fraction of patterns (biclusters) enriched by several groups of small classes at p-value 1x10-5 Fraction of class covered by patterns (biclusters) among several groups of small classes at p-value 1x10-5

Complementarity of RAP and ISA Biclusters • {YAR010C,YBL005W-A,YBR012WA, YJR026W,YJR028W,YML040W,YMR046C,YMR051C} • Enriched by RNA-mediated transposition class (GO:0032197; p-value=4.64×10−12). • Not found or part of a large (138) ISA bicluster! • {YDR158W,YER052C,YOR130C,YPR145W} • Enriched by small classes GO:0009088 (6 members), GO:0009090 (3 members) and GO:0009092 (7 members). • Embedded within large ISA biclusters of sizes 211 and 104. • None of these classes were found to enrich these large biclusters! • RAP patterns are complementary to ISA biclusters in terms of finding much smaller functionally enriched biclusters.

Summary • RAP patterns are coherent, as measured by their MSE scores, among different sets of biclusters. • RAP patterns are significantly more enriched by small and specific functional classes as compared to ISA and CC biclusters. • Gene groups discovered by RAP are enriched for functions not covered by biclusters discovered by standard biclustering algorithms.

Discriminative Pattern Mining for Biomarker Discovery Goal: Searching for group of genes that collectively show differential expression in a binarized case-control gene expression data set. The absolute difference of the relative supports of itemset α in A and B is defined as • Motivation: • Genes (or SNPs) may NOT be predictive individually, but as a group they could be. Initially proposed by [Dong et al. 1999] and [Bay and Pazzani, 2001]

Previous Work on Discriminative Pattern Mining Two-step approaches Find all frequent patterns and then use class labels to select discriminative patterns. [Wang and Karypis 2005], [Cong et al 2005], [Deshpande et al. 2005], [Cheng et al. 2007], etc. Disadvantage: Not suitable for dense and high dimensional data Approaches that make limited use of class labels during pattern mining Loose upper bounds of measures of discriminative power are derived, e.g. DiffSup, Information Gain, etc. (Emerging Patterns [Dong et al. 1999], [Bay, Pazzani 2001] (CSET), [Cheng et al 2008], [Fan et al. 2008] etc.) Disadvantage: More scalable than 2-step approaches, but still not suitable for dense and high dimensional data, especially for discovering low-supportpatterns.

Our Focus Dense & high dimensionaldata, such as SNP and gene expression data. Discover interesting patterns with relatively low support that can not be discovered by traditional approaches. Traditional approaches (2-step & CSET) Our focus: Scalable discovery of low support patterns

SupMaxPair: a New Measure for Discriminative Pattern Mining Make use of the class labels more effectively Interesting low-support patterns that may NOT be discovered by existing algorithms Uninteresting patterns that can NOT be pruned by existing algorithms Low support patterns that can be pruned by most algorithms High support patterns that can be discovered by most algorithms Fang et al., Discriminative Pattern Mining from Dense and High Dimensional Data, TR 09-011, CS Department, Univ of Minnesota, 2009. Software available at http://vk.cs.umn.edu/SMK/

Gene Expression Dataset • 25000 genes in 295 breast cancer patients • 78 patients developed metastasis within 5 years of surgery. • 217 patients did not develop metastasis within 5 years of surgery • 5981 genes are selected using the pre-processing methodologies suggested by [Vijver et al. NEJM 2001] • Binarization • Each column pertaining to the expression of a single gene is split into two binary columns (+0.2, -0.2) • 11962 binary columns

Patterns Discovered by CSET and SupMaxPair Next, we will demonstrate these are statistically and biologically significant More low-support and interesting patterns are discovered by SupMaxPair beyond CSET

Permutation Test Statistically significant Part (a): 1000 highest p-value pattern generated from 1000 sets of randomized labels. Part (b): Top-300 p-value patterns discovered by SupMaxPair using the true class label.

Validation using Known Cancer Genes 28 size-3 patterns have 100% precision (all 3 are cancer genes), and they are only discovered by SupMaxPair but not CSET. E.g., all genes in the pattern {BIRC5, LAD1, TK1} are known cancer genes. Biologically significant • A list of ~2400 human genes (Higgins et al, 2007) known to be involved in the induction, progression and suppression of various types of cancers • 611 genes overlap with the dataset used in the study • For each pattern, a precision is computed w.r.t. the 611 cancer genes

Validation using Known Cancer Genes CSET SupMaxPair can recall additional biologically significant cancer genes beyond CSET, by dicovering low-support patterns Varying p-value cutoffs SupMaxPair For the genes covered by a set of patterns, a global precision and recall measure can be computed w.r.t the 611 cancer genes.

Pattern-based Classification SupMaxPair can complement traditional algorithms (e.g. CSET), for better classification. • CSET Patterns vs. (CSET + SupMaxPair) Patterns • Patterns used as features for classification using SVM • Data set split 80-20 for training-testing (500 iterations)

Summary • On dense and high dimensional data: • SupMaxPair can complement traditional approaches by discovering statistically and biologically significant discriminative patterns with relatively low support. • Can be useful for discovering gene group biomarkers for diseases. Fang et al., Discriminative Pattern Mining from Dense and High Dimensional Data, TR 09-011, CS Department, Univ of Minnesota, 2009. Software available at http://vk.cs.umn.edu/SMK/

Thank you! www.cs.umn.edu/~kumar/dmbio • Group Members: Gaurav Pandey Gowtham Atluri Gang Fang Rohit Gupta Michael Steinbach Vipin Kumar University of Minnesota kumar@cs.umn.edu www.cs.umn.edu/~kumar

Association Analysis Techniques for Bioinformatics Problems

Association Analysis Techniques for Bioinformatics Problems

Presentation Transcript

Approximation Techniques for Coloring Problems

Bioinformatics for Genome data analysis

Association Analysis

5 Open Problems in Bioinformatics

Similarity Evaluation Techniques for Filtering Problems

Matching Problems in Bioinformatics

Techniques For Troubleshooting FCC Regenerator Problems

Bioinformatics Cluster Analysis

EECS 395/495 Algorithmic Techniques for Bioinformatics

Analysis Techniques

Techniques That solve Business problems Grid Analysis

Fast Modeling Techniques for C4ISR Problems

Approximation Techniques for Coloring Problems

Techniques For Troubleshooting FCC Regenerator Problems

Bioinformatics for Genome data analysis

Course Sequence Analysis for Bioinformatics Master’s

Analysis Techniques

Association Analysis

Developing Bioinformatics Tools for Genome Analysis

Association Pattern Analysis – Applications in Bioinformatics