TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

TRICLUSTERAn Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data Mohammed J. Zaki & Lizhuang Zhao Department of Computer Science, Rensselaer Polytechnic Institute (RPI), Troy, NY {zhaol2, zaki}@cs.rpi.edu

Microarray Data • Essential source of information about the Gene Expression within a cell • Typically 2D: Genes x Samples (Genes x Time) • Measure the expression level of genes in different samples • Labeled samples: Classification (cancer vs. non-cancer) • Non-labeled samples: Clustering (Bi-clusters) • Goal: Identify the “expression” patterns, providing clues to the gene regulatory networks within a cell

Why Biclustering? some genes similarly expressed in some samples Bicluster full-space cluster s1s2s3s4s5 s1s2s3s4s5 g1 g2 g3 g4 g5 g1 g2 g3 g4 g5 (g2, g4, g5)×(s2, s3, s5) (g2, g4, g5)

Constant Different “Homogeneity” orSimilarity Criteria Col All Row more general Shift=0.4 Scale=1.4 Scaling/Shifting Order:2 1 3 Order Preserving Note: small noise  is allowed in all expression values

Why TriCluster? • Typical microarray data is 2D (gene x sample) • Temporal expression very important tool • How does gene expression evolve in time? • Find clusters over genes x samples x time • Spatial expression also of interest • How does gene expression differ in space (e.g., different regions of mouse brain)? • Find clusters over gene x samples x space • Combine temporal and spatial expression • Find clusters over gene x time x space, etc. • There is an emerging need to mine 3D data

TriCluster: Our Contributions • First algorithm to mine tri-clusters in 3D microarray data • Complete and deterministic • Mine maximal clusters satisfying given homogeneity criteria • Constant: column, row, all • Scaling & Shifting • Clusters can be overlapping; optionally delete/merge clusters having large overlap • Propose a set of metrics for cluster evaluation • Use Gene Ontology (GO) to access biological significance

Definitions • G is a set of genes {g0, g1, …, gn-1} • S is a set of samples {s0, s1, …, sm-1} • T is a set of time courses {t0, t1, …, tl-1} • 3D Real-valued Dataset D = {dijk}  G x S x T • dijk is the expression value of gene gi in sample sj at time tk • triCluster is a maximal submatrix of D that satisfies some homogeneity conditions • C = X x Y x Z = {cijk} • X  G, Y  S, Z  T • Given homogeneity conditions

Scaling triCluster Example 2 Time 4 1 1 2 5 Genes Ratios: 1 3 4 Note: small noise  is allowed Samples

TriCluster Concepts • C = X x Y x Z = {cijk} is a triCluster iff • C is maximal (no C’  C) • C has sufficient size: |X|  mg, |Y|  ms, |Z|  mt • Noise/error threshold  is satisfied for any C22 • C22 = is an arbitrary 2x2 submatrix of C • Let ri = | cia/cib| and rj = | cja/cjb| • Max(ri/rj) / Min(ri/rj) – 1   • Range threshold a is satisfied for each dim a •  = | cijk – cxyz | • If j=y, k=z, then   g (similarly define s, t)

TriCluster Flexibility • Cluster definition is symmetric • Any ordering of dimensions allowed • A/C≈B/D ↔ A/B≈C/D ↔ AD≈BC • Can mine several types of clusters • Typically   0 to allow small noise/error • Approx constant cluster: g 0 and s 0 and t 0 • Approx single dim constant: g 0 or s 0 or t 0 • Approx two dim constant: (g 0 and s 0) or (g 0 and t 0) or (s 0 and t 0) • Scaling cluster: g and s and t are unconstrained • Shifting cluster: if eCis a scaling C is a shifting T =

TriCluster Algorithm • Compute maximal biclusters on G x S for each time slice t  T • Construct range multigraph • Find maximal cliques • Compute triclusters from biclusters • Construct new multigraph (T x biclusters) • Find maximal cliques • Merge/Prune overlapping clusters

Maximal Biclusters • Mine each GxS time-slice for maximal biclusters • For each pair of samples, get valid ratio ranges within εand gene-sets • Construct a Range Multigraph • Mine maximal cliques • Each clique/cluster can contribute to some valid tricluster

Valid Ratio Ranges:Each Column Pair Range Example Original Data After row/col permutation • Take ratio s0 and s6 and construct valid ranges: • Range contains at least mg values within ε (noise threshold) • ε=0.05, mg=3,then 3.0×(1+ε)=3.15  range = [3, 3.15] • Other ranges = [3.3, 3.465], and so on • Construct gene-sets: [3, 3.15] has genes {g1, g4, g8}

Range Multigraph:pair of samples • Construct valid ratios & gene-sets for s1/s4 • Ratio = 1/1, gene-set = {g2g6g0g9g7} • Ratio = 5/4, gene-set = {g4g8g1} • Construct ratios/gene-sets for other pairs Multigraph

Range Multigraph: complete • Construct ratios/gene-sets for all sample pairs

Maximal Clique Mining s4 s6 s2 s3 s1 s5 s0 • Perform recursive depth-first search • Maintain valid gene-sets for each node • Intersect gene-sets with each outgoing edge • {g2g6g0g9g7} {g2g6g0g9} = {g2g6g0g9} • Prune if various criteria not met (size, dim range)

Mine triClusters • Let Bt be the set of maximal biclusters for time slice t • Construct new multigraph • Each time point is a vertex • Each pair of highly overlapping biclusters (gene-set, samples) forms an edge between time ti and tj • Call maximal clique mining to obtain maximal triclusters

Constructing triClusters

Constructing triClusters tk tj ti

Prune and Merge A Ai A B B B Aj Merge A & B L(A+B)-A-B/ L(A+B) <  Prune B LB-A/LB <  Prune B LB-  A/LB <  • Cluster Span: • LC = {(i,j,k) | gi, sj, tk C } • LAB = LA  LB • LA-B = LA – LB • LA+B = (LA – LB)  (LB – LA)  (LA  LB)

Metrics for Measuring Clustering Quality • NumClusters Number of Clusters • Span Span (X×Y×Z)=|X|×|Y|×|Z| • ElementSumSum of all cluster Spans (count multiple times) • CoverageUnion of all cluster Spans (count once) • Overlap(ElementSum - Coverage) / Coverage We want high coverage with small overlap

Synthetic Data Generation • Experiments:1.4Ghz, 448MB, Linux/Vmware • Synthetic data for parameter evaluation • Input parameters: • |G|=4000, |S|=30, |T|=20 • Number of cluster to embed = 10 • Overlap % among clusters = 20% • Noise for expression values = 3% • Cluster size range = 150x6x4 (some variation) • Generate clusters with values within some range • Fill rest of cells with random noise • Do random permutations along each dimension • We vary one parameter and keep others fixed

Results on Synthetic Datasets Time (sec) Time (sec) Time (sec) Number of Genes Number of Time-points Number of Samples Time (sec) Time (sec) Time (sec) Number of Clusters Variation (%) Overlap (%)

Results on Yeast CellCycle Dataset • http://genome-www.stanford.edu/cellcycle • Elutriation Experiment • 7679 genes • 14 time points (0 to 390mins @ 30 min gaps) • No real samples: use raw expression values of 13 attributes as samples (Cyc3, Cyc5, ratios, etc) • GxSxT = 7679 x 13 x 14 • Note: actual 3D data will become publicly available soon (e.g. Mouse Brain Atlas: genes x space x time) • Run TriCluster: mg=50, ms= 4, mt= 5, ε = 0.03 • Found 5 clusters in 28s, overlap=0, coverage=6250 • 2D view of cluster C0 (51x4x5) shown next

2D Views of cluster C0 on yeast data t=120 s=CH2I s=CH2I t=210 s=CH2D s=CH2D t=270 Expression Values Expression Values Expression Values s=CH2IN s=CH2IN t=330 s=CH2DN s=CH2DN t=390 Genes Genes Time points Sample Curves Time Curves Gene Curves

Results on Yeast Cell Cycle Dataset:Gene Ontology Significant (p-value < 0.01) Shared Gene Ontology (GO) Terms (Process, Function, Location) for Genes in Different Clusters

Results on Yeast Cell Cycle Specific Cluster Different clusters show different shared terms Results could be potentially biologically significant

Summary • Contributions • First algorithm to mine triclusters from 3D microarrays • Complete, deterministic • Allows small noise • Flexible: constant, single/two dim, scaling, shifting • Allows arbitrary overlap (merge/prune) • Potentially biologically significant clusters (GO)! • Future Work • Extend from 3-D to k-D datasets • Allow different pattern types along different axes (scaling along GxS, shifting along T, etc.) • Enhance clique mining step from multigraphs

TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

Presentation Transcript

DCS 802 Data Mining Apriori Algorithm

Clustering in Microarray Data-mining and Challenges Beyond

parallel data mining on multicore clusters

Applications to Bioinformatics: Microarray Data Mining

Mining for Low Abundance Transcripts in Microarray Data

SimPL : An Effective Placement Algorithm

T he C hampagne C luster

Coherent Dependence Clusters

CBW: An Efficient Algorithm for Frequent Itemset Mining

Data Mining-Knowledge Presentation—ID3 algorithm

Literature Review of Microarray Data Mining

parallel data mining on multicore clusters

Mining publicly available microarray data

Mining for Low-abundance Transcripts in Microarray Data

An Effective Disk Caching Algorithm in Data Grid

Algorithm for 3D-Modeling

Applications to Bioinformatics: Microarray Data Mining

Mining microarray expression data by literature profiling

Applications of Data Mining in Microarray Data Analysis

TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

Microarray data analysis – Gold-mining in a minefield

Applications to Bioinformatics: Microarray Data Mining