Pattern-based Clustering

Pattern-based Clustering • How to cluster the five objects? • Hard to define a global similarity measure

What Is Pattern-based Clustering? • A cluster: a set of objects following the same pattern in a subset of dimensions (Wang et al, 2002)

Challenges • Most clustering approaches do not address the temporal variations in time series gene expression data, which is an important feature and affect the performance. • Previous approaches try to find coherent patterns and clusters w.r.t. the entire set of attributes • Patterns may be embedded in sub attribute spaces • Only a subset of genes participate in any cellular processes of interest • Any cellular process may take place only in a subset of experiment conditions. a) raw data b) shifting patterns c) scaling patterns

Gene-Sample-Time (GST) Microarray Data A collection of samples 2D time-series data • The GST microarray data consist of three dimensions • The samples often exhibit various phenotypes, e.g., cancer vs. control 3D gene-sample-time data

Challenges of Mining GST Data • Most clustering algorithms were designed for 2D data, and cannot be directly extended for 3D data.

Coherent Gene Cluster A coherent gene cluster The 2D representation A 3D GST data set • The group of samples (sj1, sj2, sj3 ) may exhibit the same phenotype • The group of genes (gi1,gi2,gi3) may be strongly correlated to the phenotype shared by (sj1, sj2, sj3 )

Sample A Sample B Sample C Sample D Sample E Sample F Sample G Sample H Results from a Real Data Set • The Multiple Sclerosis (MS) data consist of • 4324 genes • 13 MS patients • 10 time points before and after IFN- treatment • 25 coherent gene clusters were reported An example of coherent gene clusters (107 genes, 8 samples)

Other Types of Coherent Clusters

Problem Definition • Given a GST microarray data matrix M, a maximal coherent gene cluster C=(GS) is a combination of a subset of genes G and a subset of samples S such that: • Coherent: the subset of genes G are coherent across the subset of samples S; • Significant: |G|≥ming, |S|≥mins, where ming and mins are user-specified parameters; • Maximal: any insertion of gG or sS will make C not coherent. • The problem of mining coherent gene clusters is to find the complete set of maximal coherent gene clusters in M.

Coherence Measure • Various coherence measures exist. • Measure selection is application dependent. • A general coherence model • Given a coherence measure sim(•) and a user-specified threshold , • A gene ga is coherent on samples siand sj, if sim(pai,paj)≥ . • Coherent gene matrix (G1,S1): if every gene gi  G1 is coherent across samples in S1. • Trivial coherent gene matrix: ({gi}, {sj}), (G, {sj}) • We choose the Person’s correlation coefficient. • Other coherence measures are also applicable.

Related Work • Clustering algorithms on Gene-Sample or Gene-Time microarray data • The cluster model is completely different • Subspace clustering • Find subsets of objects coherent with subsets of attributes • Frequent pattern mining • Find subsets of items frequently appearing in transaction databases

Algorithm Outline • Phase 1 (Pre-processing) : For each gene g, find the complete set of maximal coherent sample sets of gene g. • Phase 2: Compute the complete set of maximal coherent gene clusters based on pre-processing results.

Coherent Sample Sets • Given a gene g, a maximal coherent sample set of g is a subset of samples Si such that: • coherent : g is coherent across Si; • significant : |Si|  mins; • maximal : there exists no superset S’Si such that g is also coherent with S’. • (g Si ) is a building block for coherent gene clusters including g.

s5 s4 s3 s4 s6 s3 s1 s5 s6 s2 Preprocessing Phase Suppose mins = 3 {s3,s4,s5,s6} is a coherent sample set of gene g The coherence matrix of gene g The coherence graph of gene g

Sample-gene Search • Set enumeration tree • Enumerate all subsets of samples systematically. • Each node on the tree corresponds to a subset of samples. • For each node S • Find the maximal set of genes Gs which is coherent with S

{} {a} {b} {c} {d} {a,b} {a,c} {a,d} {b,c} {b,d} {c,d} {a,b,c} {a,b,d} {a,c,d} {b,c,d} {a,b,c,d} Set Enumeration Tree The set enumeration tree for {a,b,c,d}

Find the Maximal Coherent Subset of Genes • After the pre-processing phase: • Given a subset of samples S, how to find the maximal coherent set of genes GS? • Expensive approach: scan the table once For each S, Gs can be derived by a single scan of the maximal coherent samples of all genes. If S  Sj, g  Gs. • Efficient approach: use the inverted list.

The Inverted List g2.b1 g2.b2 The table of maximal coherent sample sets for genes The table of inverted lists for samples

Intersection Instead of Scanning • Given a subset of samples S={si1,…,sik}, intersect the inverted lists of si1,…,sik. • For example, given S={s1,s2,s3}, Ls1^Ls2^Ls3={g1.b1,g3.b1,g4.b1}, so Gs={g1,g3,g4}. • Suppose the parent of S is S’={si1,…,sik-1}, then LS=LS’Lsik.

Anti-monotonic Property • Given a combination (GS), • if G is not coherent on S, • then for any superset S’S, G cannot be coherent on S’. • For any descendant S’ of S on the tree • let GS be the maximal coherent gene set of S, • let GS’ be the maximal coherent gene sets of S’, • since S’S, we have GS’ GS.

Pruning Irrelevant Samples • Given a subset of samples S={si1,…,sik}, a sample sjtails, if • j > ik • there exists at least ming genes g such that g is coherent with S{sj} • Samples sltails(irrelevant samples) cannot be used to extend S.

Pruning Unpromising Nodes • Given a subset of samples S={si1,…,sik}, • if |S|+|tails|< mins, then prune the subtree of S. • let the maximal coherent subset of genes of S be Gs, • if there exists (G’S’) such that • (Stails)  S’ • GsG’, • the prune the subtree of S

Determination of Maximal Coherent Gene Clusters • The depth-first search strategy: • For any superset S’ of S, S’ is • visited before S; • or a child of S. • To determine whether a coherent gene cluster (GsS) is maximal, • check (GsS) after visiting all its children, • report (GsS) if it is not subsumed.

{ } {s2} {s3,s4} {s1} {s2,s3,s4,s5} {s3} {} {s4} {} {s1,s4} {} {g1.b1, g2.b1, g3.b1} {s2,s3} {} {g1.b1, g3.b1, g4.b1} {s2,s4} {} {g1.b1, g2.b1, g3.b1} {s1,s2} {s3,s4} {g1.b1, g2.b1, g3.b1, g4.b1} {s1,s3} {} {g1.b1, g3.b1, g4.b1} {s1,s2,s3} {} {g1.b1,g3.b1,g4.b1} {s1,s2,s4} {} {g1.b1,g2.b1,g3.b1}

Mining Coherent Gene Clusters • Systematic enumeration of genes and samples • Sample-Gene Search • Gene-Sample Search • Pruning rules • Determination of whether a coherent gene cluster (GS) is maximal

Gene-sample Search

Experiment Data Sets • Real-world gene expression data • 4324 genes • 13 multiple sclerosis (MS) patients • before and at 1,2,4,8,24,48,120 and 168 hours after IFN- treatment • Synthetic data • Given the number of genes NG, samples NS and coherent gene clusters NC • Simulate the pre-processing results • Embed NC maximal coherent gene clusters (GS)

A Coherent Gene Cluster from Real Data

Effect of Parameters Number of clusters vs. ming (mins=3,=0.8) Number of clusters vs. mins (ming=10, =0.8) Number of clusters vs.  (ming=10,mins=3)

Scalability Scalability w.r.t. number of genes (number of samples: 30) Scalability w.r.t. number of samples (number of genes: 3,000) Scalability of phase 1

Conclusion • We define the new problem of mining coherent gene clusters from the novel gene-sample-time microarray data. • We propose two approaches: the sample-gene search and the gene-sample search. • We conduct an extensive empirical evaluation on both real and synthetic data sets.

Future Work • New problems from the gene-sample-time microarray data: • Coherent sample clusters (GS) • for each sS, any pair of genes gi, gjG has coherent patterns. • Coherent gene-sample clusters (GS), • both a coherent gene cluster and a coherent sample cluster.

Pattern-based Clustering