Essential Statistics in Biology: Getting the Numbers Right

Essential Statistics in Biology: Getting the Numbers Right • Raphael Gottardo • Clinical Research Institute of Montreal (IRCM) • raphael.gottardo@ircm.qc.ca • http://www.rglab.org

Outline • Exploratory Data Analysis • 1-2 sample t-tests, multiple testing • Clustering • SVD/PCA • Frequentists vs. Bayesians 2 Day 1

Clustering(Multivariate analysis)

Outline • Basics of clustering • Hierarchical clustering • K-means • Model based clustering 4 Day 1 - Section 3

What is it? Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure. Examples: www, gene clustering 5 Day 1 - Section 3

Hierarchical clustering Given N items and a distance metric • Assign each item to a clusterInitialize the distance matrix between clusters as the distance between items • Find the closest pair of clusters and merge them into a single cluster • Compute new distances between clusters • Repeat 2-3 until call items are classified into a single cluster 6 Day 1 - Section 3

Single linkage The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d 7 Day 1 - Section 3

Complete linkage The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d 8 Day 1 - Section 3

Average linkage The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d=Average of all distances 9 Day 1 - Section 3

Example Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles) 10 Day 1 - Section 3

Example cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") 11 Day 1 - Section 3

Example Single linkage 12 Day 1 - Section 3

Example Single linkage Single linkage k=2 13 Day 1 - Section 3

Example Single linkage k=3 14 Day 1 - Section 3

Example 1 2 Single linkage k=4 3 4 18 Day 1 - Section 3

Example Complete linkage k=4 19 Day 1 - Section 3

Example 1 2 Complete linkage k=4 3 4 20 Day 1 - Section 3

K-means N items, assume K clusters Goal is to minimized over the possible assignments and centroids . represents the location of the cluster. 21 Day 1 - Section 3

K-means - algorithm • Divide the data into K clustersInitialize the centroids with the mean of the clusters • Assign each item to the cluster with closest centroid • When all objects have been assigned, recalculate the centroids (mean) • Repeat 2-3 until the centroids no longer move 22 Day 1 - Section 3

K-means - algorithm set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } 23 Day 1 - Section 3

Example 1 2 Why? 3 4 24 Day 1 - Section 3

Example 1 2 3 4 26 Day 1 - Section 3

Summary • K-means and hierarchical clustering methods are useful techniques • Fast and easy to implementBeware of memory requirements for HC • A bit “ad hoc”: • Number of clusters? • Distance metric? • Good clustering? 27 Day 1 - Section 3

Model based clustering • Based on probability models (e.g. Normal mixture models) • We could talk about good clustering • Compare several models • Estimate the number of clusters! 28 Day 1 - Section 3

Model based clustering Yeung et al. (2001) Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance 29 Day 1 - Section 3

Model based clustering Banfield and Raftery (1993) Eigenvalue decomposition Volume Orientation Shape 30 Day 1 - Section 3

Model based clustering Equal volume spherical EII 0 Unequal volume spherical VII 0 Equal volume, shape, orientation (EEE) 0 Unconstrained (VVV) 0 31 Day 1 - Section 3

Estimation Likelihood (Mixture model) Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN 32 Day 1 - Section 3

Model selection Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC 33 Day 1 - Section 3

Model selection We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model NB: and might have different dimensions 34 Day 1 - Section 3

Model selection To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model 35 Day 1 - Section 3

Model selection Bayesian information criteria: Measure of fit Penalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust) 36 Day 1 - Section 3

Example revisited 1 EII library(mclust) cho.mclust.bic<-EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std) 2 EEI 37 Day 1 - Section 3

Example revisited par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value") 38 Day 1 - Section 3

Example revisited 1 2 EII 4 clusters 3 4 39 Day 1 - Section 3

Example revisited cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") 40 Day 1 - Section 3

Example revisited 1 2 EEI 3 clusters 3 41 Day 1 - Section 3

Summary • Model based clustering is a nice alternative to heuristic clustering algorithms • BIC can be used for choosing the covariance structure and the number of clusters 42 Day 1 - Section 3

Conclusion • We have seen a few clustering algorithms • There are many others • Two way clustering • Plaid model ... • Clustering is a useful tool and ... a dangerous weapon • To be consumed with moderation! 43 Day 1 - Section 3

Essential Statistics in Biology: Getting the Numbers Right

Essential Statistics in Biology: Getting the Numbers Right

Presentation Transcript

Chapter 6

AMMUNITION LOT NUMBERS April 2008

Estimating Quotients using Compatible Numbers

The concept of essential drugs and the WHO Model List of Essential Medicines

Oral Biology 5301

Descriptive Statistics Univariate Statistics Chi Square ANOVA

Elementary Algebra

Statistics

Computation

Biology 323 Human Anatomy for Biology Majors Lecture 3 Dr. Stuart S. Sumida

Chapter 1

CBIO243: Principles of Cancer Systems Biology

Pre Calc—Chapter 1

Statistics: an introduction

Indonesian Pronunciation

Computational Systems Biology … Biology X – Lecture 1 …

Playing with numbers 0-10... Can you win?

Speyside High School Biology Department

Biology EOC Review 2012

Keystone Connections: Biology

Welcome to Dr. Rathman ’ s Biology 156 Class !!!