290 likes | 388 Views
TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data. Mohammed J. Zaki & Lizhuang Zhao Department of Computer Science, Rensselaer Polytechnic Institute (RPI), Troy, NY {zhaol2, zaki}@cs.rpi.edu. Microarray Data.
E N D
TRICLUSTERAn Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data Mohammed J. Zaki & Lizhuang Zhao Department of Computer Science, Rensselaer Polytechnic Institute (RPI), Troy, NY {zhaol2, zaki}@cs.rpi.edu
Microarray Data • Essential source of information about the Gene Expression within a cell • Typically 2D: Genes x Samples (Genes x Time) • Measure the expression level of genes in different samples • Labeled samples: Classification (cancer vs. non-cancer) • Non-labeled samples: Clustering (Bi-clusters) • Goal: Identify the “expression” patterns, providing clues to the gene regulatory networks within a cell
Why Biclustering? some genes similarly expressed in some samples Bicluster full-space cluster s1s2s3s4s5 s1s2s3s4s5 g1 g2 g3 g4 g5 g1 g2 g3 g4 g5 (g2, g4, g5)×(s2, s3, s5) (g2, g4, g5)
Constant Different “Homogeneity” orSimilarity Criteria Col All Row more general Shift=0.4 Scale=1.4 Scaling/Shifting Order:2 1 3 Order Preserving Note: small noise is allowed in all expression values
Why TriCluster? • Typical microarray data is 2D (gene x sample) • Temporal expression very important tool • How does gene expression evolve in time? • Find clusters over genes x samples x time • Spatial expression also of interest • How does gene expression differ in space (e.g., different regions of mouse brain)? • Find clusters over gene x samples x space • Combine temporal and spatial expression • Find clusters over gene x time x space, etc. • There is an emerging need to mine 3D data
TriCluster: Our Contributions • First algorithm to mine tri-clusters in 3D microarray data • Complete and deterministic • Mine maximal clusters satisfying given homogeneity criteria • Constant: column, row, all • Scaling & Shifting • Clusters can be overlapping; optionally delete/merge clusters having large overlap • Propose a set of metrics for cluster evaluation • Use Gene Ontology (GO) to access biological significance
Definitions • G is a set of genes {g0, g1, …, gn-1} • S is a set of samples {s0, s1, …, sm-1} • T is a set of time courses {t0, t1, …, tl-1} • 3D Real-valued Dataset D = {dijk} G x S x T • dijk is the expression value of gene gi in sample sj at time tk • triCluster is a maximal submatrix of D that satisfies some homogeneity conditions • C = X x Y x Z = {cijk} • X G, Y S, Z T • Given homogeneity conditions
Scaling triCluster Example 2 Time 4 1 1 2 5 Genes Ratios: 1 3 4 Note: small noise is allowed Samples
TriCluster Concepts • C = X x Y x Z = {cijk} is a triCluster iff • C is maximal (no C’ C) • C has sufficient size: |X| mg, |Y| ms, |Z| mt • Noise/error threshold is satisfied for any C22 • C22 = is an arbitrary 2x2 submatrix of C • Let ri = | cia/cib| and rj = | cja/cjb| • Max(ri/rj) / Min(ri/rj) – 1 • Range threshold a is satisfied for each dim a • = | cijk – cxyz | • If j=y, k=z, then g (similarly define s, t)
TriCluster Flexibility • Cluster definition is symmetric • Any ordering of dimensions allowed • A/C≈B/D ↔ A/B≈C/D ↔ AD≈BC • Can mine several types of clusters • Typically 0 to allow small noise/error • Approx constant cluster: g 0 and s 0 and t 0 • Approx single dim constant: g 0 or s 0 or t 0 • Approx two dim constant: (g 0 and s 0) or (g 0 and t 0) or (s 0 and t 0) • Scaling cluster: g and s and t are unconstrained • Shifting cluster: if eCis a scaling C is a shifting T =
TriCluster Algorithm • Compute maximal biclusters on G x S for each time slice t T • Construct range multigraph • Find maximal cliques • Compute triclusters from biclusters • Construct new multigraph (T x biclusters) • Find maximal cliques • Merge/Prune overlapping clusters
Maximal Biclusters • Mine each GxS time-slice for maximal biclusters • For each pair of samples, get valid ratio ranges within εand gene-sets • Construct a Range Multigraph • Mine maximal cliques • Each clique/cluster can contribute to some valid tricluster
Valid Ratio Ranges:Each Column Pair Range Example Original Data After row/col permutation • Take ratio s0 and s6 and construct valid ranges: • Range contains at least mg values within ε (noise threshold) • ε=0.05, mg=3,then 3.0×(1+ε)=3.15 range = [3, 3.15] • Other ranges = [3.3, 3.465], and so on • Construct gene-sets: [3, 3.15] has genes {g1, g4, g8}
Range Multigraph:pair of samples • Construct valid ratios & gene-sets for s1/s4 • Ratio = 1/1, gene-set = {g2g6g0g9g7} • Ratio = 5/4, gene-set = {g4g8g1} • Construct ratios/gene-sets for other pairs Multigraph
Range Multigraph: complete • Construct ratios/gene-sets for all sample pairs
Maximal Clique Mining s4 s6 s2 s3 s1 s5 s0 • Perform recursive depth-first search • Maintain valid gene-sets for each node • Intersect gene-sets with each outgoing edge • {g2g6g0g9g7} {g2g6g0g9} = {g2g6g0g9} • Prune if various criteria not met (size, dim range)
Mine triClusters • Let Bt be the set of maximal biclusters for time slice t • Construct new multigraph • Each time point is a vertex • Each pair of highly overlapping biclusters (gene-set, samples) forms an edge between time ti and tj • Call maximal clique mining to obtain maximal triclusters
Constructing triClusters tk tj ti
Constructing triClusters tk tj ti
Prune and Merge A Ai A B B B Aj Merge A & B L(A+B)-A-B/ L(A+B) < Prune B LB-A/LB < Prune B LB- A/LB < • Cluster Span: • LC = {(i,j,k) | gi, sj, tk C } • LAB = LA LB • LA-B = LA – LB • LA+B = (LA – LB) (LB – LA) (LA LB)
Metrics for Measuring Clustering Quality • NumClusters Number of Clusters • Span Span (X×Y×Z)=|X|×|Y|×|Z| • ElementSumSum of all cluster Spans (count multiple times) • CoverageUnion of all cluster Spans (count once) • Overlap(ElementSum - Coverage) / Coverage We want high coverage with small overlap
Synthetic Data Generation • Experiments:1.4Ghz, 448MB, Linux/Vmware • Synthetic data for parameter evaluation • Input parameters: • |G|=4000, |S|=30, |T|=20 • Number of cluster to embed = 10 • Overlap % among clusters = 20% • Noise for expression values = 3% • Cluster size range = 150x6x4 (some variation) • Generate clusters with values within some range • Fill rest of cells with random noise • Do random permutations along each dimension • We vary one parameter and keep others fixed
Results on Synthetic Datasets Time (sec) Time (sec) Time (sec) Number of Genes Number of Time-points Number of Samples Time (sec) Time (sec) Time (sec) Number of Clusters Variation (%) Overlap (%)
Results on Yeast CellCycle Dataset • http://genome-www.stanford.edu/cellcycle • Elutriation Experiment • 7679 genes • 14 time points (0 to 390mins @ 30 min gaps) • No real samples: use raw expression values of 13 attributes as samples (Cyc3, Cyc5, ratios, etc) • GxSxT = 7679 x 13 x 14 • Note: actual 3D data will become publicly available soon (e.g. Mouse Brain Atlas: genes x space x time) • Run TriCluster: mg=50, ms= 4, mt= 5, ε = 0.03 • Found 5 clusters in 28s, overlap=0, coverage=6250 • 2D view of cluster C0 (51x4x5) shown next
2D Views of cluster C0 on yeast data t=120 s=CH2I s=CH2I t=210 s=CH2D s=CH2D t=270 Expression Values Expression Values Expression Values s=CH2IN s=CH2IN t=330 s=CH2DN s=CH2DN t=390 Genes Genes Time points Sample Curves Time Curves Gene Curves
Results on Yeast Cell Cycle Dataset:Gene Ontology Significant (p-value < 0.01) Shared Gene Ontology (GO) Terms (Process, Function, Location) for Genes in Different Clusters
Results on Yeast Cell Cycle Specific Cluster Different clusters show different shared terms Results could be potentially biologically significant
Summary • Contributions • First algorithm to mine triclusters from 3D microarrays • Complete, deterministic • Allows small noise • Flexible: constant, single/two dim, scaling, shifting • Allows arbitrary overlap (merge/prune) • Potentially biologically significant clusters (GO)! • Future Work • Extend from 3-D to k-D datasets • Allow different pattern types along different axes (scaling along GxS, shifting along T, etc.) • Enhance clique mining step from multigraphs