250 likes | 374 Views
Microarray Data Analysis. Data preprocessing and visualization Supervised learning Machine learning approaches Unsupervised learning Clustering and pattern detection Gene regulatory regions predictions based co-regulated genes
E N D
Microarray Data Analysis • Data preprocessing and visualization • Supervised learning • Machine learning approaches • Unsupervised learning • Clustering and pattern detection • Gene regulatory regions predictions based co-regulated genes • Linkage between gene expression data and gene sequence/function databases • …
Unsupervised learning • Supervised methods • Can only validate or reject hypotheses • Can not lead to discovery of unexpected partitions • Unsupervised learning • No prior knowledge is used • Explore structure of data on the basis of corrections and similarities
DEFINITION OF THE CLUSTERING PROBLEM Eytan Domany
CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION) Eytan Domany
BUT WHAT ABOUT THE OKAPI? Eytan Domany
Centroid methods – K-means Data points at Xi , i= 1,...,N Centroids at Y , = 1,...,K Assign data point i to centroid ; Si = Cost E: E(S1 , S2 ,...,SN ; Y1 ,...YK ) = MinimizeE over Si , Y Eytan Domany
K-means • “Guess” K=3 Eytan Domany
K-means • Start with random positions of centroids. Iteration = 0 Eytan Domany
K-means • Start with random positions of centroids. • Assign each data point to closest centroid. Iteration = 1 Eytan Domany
K-means • Start with random positions of centroids. • Assign each data point to closest centroid. • Move centroids to center of assigned points Iteration = 2 Eytan Domany
K-means • Start with random positions of centroids. • Assign each data point to closest centroid. • Move centroids to center of assigned points • Iterate till minimal cost Iteration = 3 Eytan Domany
K-means - Summary • Fast algorithm: compute distances from data points to centroids • Result depends on initial centroids’ position • Must preset K • Fails for “non-spherical” distributions
2 4 5 3 1 1 3 2 4 5 Agglomerative Hierarchical Clustering Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers at each step merge pair of nearestclusters initially – each point = cluster Distance between joined clusters The dendrogram induces a linear ordering of the data points Dendrogram Eytan Domany
Hierarchical Clustering -Summary • Results depend on distance update method • Greedy iterative process • NOT robust against noise • No inherent measure to identify stable clusters • Average Linkage – the most widely used clustering method in gene expression analysis
nature 2002 breast cancer Heat map
Cluster both genes and samples • Sample should cluster together based on experimental design • Often a way to catch labelling errors or heterogeneity in samples
Correlation coeff Heap map Normalized across each gene
Pearson distance Distance Issues • Euclidean distance g1 g3 g2 g4
Exercise • Use Average Linkage Algorithm and Manhattan distance.
Issues in Cluster Analysis • A lot of clustering algorithms • A lot of distance/similarity metrics • Which clustering algorithm runs faster and uses less memory? • How many clusters after all? • Are the clusters stable? • Are the clusters meaningful?
Which Clustering Method Should I Use? • What is the biological question? • Do I have a preconceived notion of how many clusters there should be? • How strict do I want to be? Spilt or Join? • Can a gene be in multiple clusters? • Hard or soft boundaries between clusters
The End • Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it. • We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have. • We wish you all have a wonderful summer break!