430 likes | 513 Views
Essential Statistics in Biology: Getting the Numbers Right. Raphael Gottardo Clinical Research Institute of Montreal (IRCM) raphael.gottardo@ircm.qc.ca http://www.rglab.org. Outline. Exploratory Data Analysis 1-2 sample t -tests, multiple testing Clustering SVD/PCA
E N D
Essential Statistics in Biology: Getting the Numbers Right • Raphael Gottardo • Clinical Research Institute of Montreal (IRCM) • raphael.gottardo@ircm.qc.ca • http://www.rglab.org
Outline • Exploratory Data Analysis • 1-2 sample t-tests, multiple testing • Clustering • SVD/PCA • Frequentists vs. Bayesians 2 Day 1
Outline • Basics of clustering • Hierarchical clustering • K-means • Model based clustering 4 Day 1 - Section 3
What is it? Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure. Examples: www, gene clustering 5 Day 1 - Section 3
Hierarchical clustering Given N items and a distance metric • Assign each item to a clusterInitialize the distance matrix between clusters as the distance between items • Find the closest pair of clusters and merge them into a single cluster • Compute new distances between clusters • Repeat 2-3 until call items are classified into a single cluster 6 Day 1 - Section 3
Single linkage The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d 7 Day 1 - Section 3
Complete linkage The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d 8 Day 1 - Section 3
Average linkage The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d=Average of all distances 9 Day 1 - Section 3
Example Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles) 10 Day 1 - Section 3
Example cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") 11 Day 1 - Section 3
Example Single linkage 12 Day 1 - Section 3
Example Single linkage Single linkage k=2 13 Day 1 - Section 3
Example Single linkage k=3 14 Day 1 - Section 3
Example Single linkage k=4 15 Day 1 - Section 3
Example Single linkage k=5 16 Day 1 - Section 3
Example Single linkage k=25 17 Day 1 - Section 3
Example 1 2 Single linkage k=4 3 4 18 Day 1 - Section 3
Example Complete linkage k=4 19 Day 1 - Section 3
Example 1 2 Complete linkage k=4 3 4 20 Day 1 - Section 3
K-means N items, assume K clusters Goal is to minimized over the possible assignments and centroids . represents the location of the cluster. 21 Day 1 - Section 3
K-means - algorithm • Divide the data into K clustersInitialize the centroids with the mean of the clusters • Assign each item to the cluster with closest centroid • When all objects have been assigned, recalculate the centroids (mean) • Repeat 2-3 until the centroids no longer move 22 Day 1 - Section 3
K-means - algorithm set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } 23 Day 1 - Section 3
Example 1 2 Why? 3 4 24 Day 1 - Section 3
Example 1 2 3 4 26 Day 1 - Section 3
Summary • K-means and hierarchical clustering methods are useful techniques • Fast and easy to implementBeware of memory requirements for HC • A bit “ad hoc”: • Number of clusters? • Distance metric? • Good clustering? 27 Day 1 - Section 3
Model based clustering • Based on probability models (e.g. Normal mixture models) • We could talk about good clustering • Compare several models • Estimate the number of clusters! 28 Day 1 - Section 3
Model based clustering Yeung et al. (2001) Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance 29 Day 1 - Section 3
Model based clustering Banfield and Raftery (1993) Eigenvalue decomposition Volume Orientation Shape 30 Day 1 - Section 3
Model based clustering Equal volume spherical EII 0 Unequal volume spherical VII 0 Equal volume, shape, orientation (EEE) 0 Unconstrained (VVV) 0 31 Day 1 - Section 3
Estimation Likelihood (Mixture model) Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN 32 Day 1 - Section 3
Model selection Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC 33 Day 1 - Section 3
Model selection We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model NB: and might have different dimensions 34 Day 1 - Section 3
Model selection To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model 35 Day 1 - Section 3
Model selection Bayesian information criteria: Measure of fit Penalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust) 36 Day 1 - Section 3
Example revisited 1 EII library(mclust) cho.mclust.bic<-EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std) 2 EEI 37 Day 1 - Section 3
Example revisited par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value") 38 Day 1 - Section 3
Example revisited 1 2 EII 4 clusters 3 4 39 Day 1 - Section 3
Example revisited cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") 40 Day 1 - Section 3
Example revisited 1 2 EEI 3 clusters 3 41 Day 1 - Section 3
Summary • Model based clustering is a nice alternative to heuristic clustering algorithms • BIC can be used for choosing the covariance structure and the number of clusters 42 Day 1 - Section 3
Conclusion • We have seen a few clustering algorithms • There are many others • Two way clustering • Plaid model ... • Clustering is a useful tool and ... a dangerous weapon • To be consumed with moderation! 43 Day 1 - Section 3