Alberto Bertoni, Giorgio Valentini

Model order selection for clustered bio-molecular data Alberto Bertoni, Giorgio Valentini {bertoni,valentini}@dsi.unimi.ithttp://homes.dsi.unimi.it/~valenti DSI - Dipartimento di Scienze dell’Informazione Università degli Studi di Milano

Bio-medical motivations: Finding “robust” subclasses of pathologies using bio-molecular data. Discovering reliable structures in high-dimensional bio-molecular data. More general motivations: Assessing the reliability of clusterings discovered in high dimensioanl data Estimating the significance of the discovered clusterings Objectives: Development of stability-based methods designed to discover structures in high-dimensional bio-molecular data Development of methods to find multiple and hierarchical structures in the data Assessing the significance of the solutions through the application of statistical tests in the context of unsupervied model order selection problems. Motivations and objectives

Model order selection through stability-based procedures • In this conceptual framework multiple clusterings are obtained by introducing perturbations (e.g. subsampling, BenHur et al, 2002; noise injection, Mc Shane et al, 2003) into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. • A general stability based procedure to estimate the reliability of a given clustering: • Randomly perturb the data many times according to a given perturbation procedure. • Apply a given clustering algorithm to the perturbed data • Apply a given clustering similarity measure (e.g. Jaccard similarity) to multiple pairs of k-clusterings obtained according to steps 1 and 2. • Use the similarity measures to assess the stability of a given clustering. • Repeat steps 1 to 4 for multiple values of k and select the most stable clustering(s) as the most reliable.

A stability based method based on random projections (1) • Data perturbation through a randomized mapping, such that for every pair : • An example of a randomized mapping (Plus-Minus-one randomized map, Achlioptas, 2001): • In (Bertoni and Valentini, 2006) we proposed to choose d’ according to the Johnson-Lindenstrauss (JL) lemma(1984): • Given a data set D with|D|=n examples there exists a e-distortion embedding into Rd’ with d’=c log n/e2 , where c is a suitable constant. • Using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the structure of the original data

A stability based method based on random projections (2): the MOSRAM algorithm MOSRAM (Model Order Selection by Randomized Maps): Input: D: a dataset; kmax: max number of clusters;n: number of pairs of random projections; m a randomized map; Clust: a clustering algorithm; sim : a clustering similarity measure. Output: M(i,k): list of similarity measures for each k (1≤i≤n, 2≤k≤kmax ) begin for k:=2 to kmax do for i:=1 to n do proja := m(D) projb := m(D) Ca := Clust(proja, k) Cb := Clust(projb, k) M(i,k) := sim(Ca,Cb) endfor endfor end.

Using the distribution of the similarities to estimate the stability • Sk (0≤ Sk≤1) is the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm to pairs of random independently perturbed data. The intuitive idea is that if Sk is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. • fk(s) is the density function of Sk. We have: • g(k) isa parameter of concentration (BenHur et al. 2002) • We may observe the following facts: E[Sk] can be used as a good index of the reliability of the k-clusterings • E[Sk] may be estimated through the empirical means xk: where , wherer is a randomized perturbation procedure. • Note that we use the overall distribution of the the similarity measures to assess the stability of the k-clusterings

A c2-based method to estimate the significance of the discovered clusterings (1) • We may perform a sorting of the : pis the index permutation such that • For each k-clustering, we consider two groups of pairwise clustering similarities values separated by a threshold to. Thus we may obtain: P(Sk>to) = 1- F(Sk=to) • xk= P(Sk>to)nis the number of times for which the similarity values are larger than to, where n is the number of repeated similarity measurements. Hence xkmay be interpreted as the successes from a binomial population with parameter qk. • Setting Xk as a random variable that counts how many timesSk>to, we have: • the unknownqk is estimated through its pooled estimate We can compute the following statistic:

A c2-based method to estimate the significance of the discovered clusterings (2) • Using the previous Y statisticwe can test the following alternative hypotheses: • - Ho: all the qk are equal to q (the considered set of k-clusterings are equally reliable)- Ha: the qk are not all equal between them (the considered set of k-clusterings are not equally reliable) • If we may reject the null hypothesis at a significance level, that is we may conclude that with probability 1-a the considered proportions are different, and hence that at least one k-clustering significantly differs from the others. • The test is iterated until no significant difference of the similarities between the k-clusterings is detected:Using the above test we start considering all the k-clustering. If a difference at a significance level is registered according to the statistical test we exclude the last clustering (according to the sorting of xk) and we repeat the test with the remaining k-clusterings. This process is iterated until no significant difference is detected: the set of the remaining (top sorted) k-clusterings represents the set of the estimate stable number of clusters discovered (at a significance level).

Experiments with high dimensional synthetic data (I) Histograms of the similarity measures obtained by applying PAM clustering to 100 pairs of PMO projections from 1000 to 471-dimensional subspaces (e=0.2): • 1000-dimensional synthetic data • data distributed according to a multivariate gaussian distribution • 2 or 6 clusters of data (as highlighted by the PCA projection to the two principal components)

Experiments with high dimensional synthetic data (II) Empirical cumulative distribution of the similarity measures for different k-clusterings Similarity k p-value mean variance 2 ---- 1.0000 0.0000 6 1.0000 1.0000 0.0000 7 0.0000 0.9217 0.0016 8 0.0000 0.8711 0.0033 9 0.0000 0.8132 0.0042 5 0.0000 0.8090 0.0104 3 0.0000 0.8072 0.0157 10 0.0000 0.7715 0.0056 4 0.0000 0.7642 0.0158 Sorting according to the means 2 and 6 clusters are selected at 0.01 significance level

Detection of multiple structures Empirical cumulative distribution of the similarity measures for different k-clusterings k p-value mean variance 3 -------- 1.0000 0.0000e+00 6 1.0000e+00 0.9979 1.6185e-05 12 1.0000e+00 0.9907 8.0657e-05 13 6.9792e-03 0.9809 2.8658e-04 14 2.2928e-06 0.9754 3.3594e-04 15 0.0000e+00 0.9580 6.8150e-04 7 0.0000e+00 0.9435 2.3055e-03 8 0.0000e+00 0.8954 4.6829e-03 5 0.0000e+00 0.8947 1.5433e-02 11 0.0000e+00 0.8897 3.2340e-03 9 0.0000e+00 0.8706 6.9421e-03 10 0.0000e+00 0.8691 5.0763e-03 4 0.0000e+00 0.8609 9.3463e-03 2 0.0000e+00 0.8532 2.3234e-02 3,6 and 12 clusters are selected at 0.01 significance level

Discovering significant structures in bio-molecular data(Leukemia data, Golub et al. 1999) Empirical cumulative distribution of the similarity measures for different k-clusterings Similarity k p-value mean variance 2 --------- 0.8285 0.0077 3 7.3280e-01 0.8060 0.0124 4 2.3279e-06 0.6589 0.0060 5 9.5199e-11 0.6012 0.0073 6 6.3282e-15 0.5424 0.0057 7 0.0000e+00 0.5160 0.0062 8 0.0000e+00 0.4865 0.0050 9 0.0000e+00 0.4819 0.0060 10 0.0000e+00 0.4744 0.0049 2 and 3 clusters are selected at 0.01 significance level

Comparison with other methods * Note that the subdivision of Lymphoma samples in 3 classes (DLBCL, CLL and FL) is performed on histopathological and morphological basis and this classification does not seem to correspond to the bio-molecular classification (Alizadeh et al., 2000)

Conclusions • The proposed stability method based on random projections is well-suited to discover structures in high-dimensional bio-medical data. • The reliability of the discovered k-clusterings may be estimated exploiting the distribution of the clustering pairwise similarities, and a c2-based statistical test tailored to unsupervised model order selection. • The c2-based test assumes that the random variables are normally distributed. We are developing a new distribution-independent approach based on the Bernstein inequality to assess the significance of the discovered k-clusterings.

Alberto Bertoni, Giorgio Valentini

Alberto Bertoni, Giorgio Valentini

Presentation Transcript

Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Stevia rebaudiana Bertoni

GIORGIO ARMANI

Giorgio Vasari

Giorgio Valentini

Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Giorgio Armani

Giorgio Armani

Flaminio Bertoni

Giorgio Arcangeli

Giorgio Arnaldi

Giorgio Valentini

Giorgio Milano

Giorgio Milano