Microarray Data Analysis

Microarray Data Analysis • Data preprocessing and visualization • Supervised learning • Machine learning approaches • Unsupervised learning • Clustering and pattern detection • Gene regulatory regions predictions based co-regulated genes • Linkage between gene expression data and gene sequence/function databases • …

Unsupervised learning • Supervised methods • Can only validate or reject hypotheses • Can not lead to discovery of unexpected partitions • Unsupervised learning • No prior knowledge is used • Explore structure of data on the basis of corrections and similarities

DEFINITION OF THE CLUSTERING PROBLEM Eytan Domany

CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION) Eytan Domany

BUT WHAT ABOUT THE OKAPI? Eytan Domany

Centroid methods – K-means Data points at Xi , i= 1,...,N Centroids at Y , = 1,...,K Assign data point i to centroid  ; Si =  Cost E: E(S1 , S2 ,...,SN ; Y1 ,...YK ) = MinimizeE over Si , Y Eytan Domany

K-means • “Guess” K=3 Eytan Domany

K-means • Start with random positions of centroids. Iteration = 0 Eytan Domany

K-means • Start with random positions of centroids. • Assign each data point to closest centroid. Iteration = 1 Eytan Domany

K-means • Start with random positions of centroids. • Assign each data point to closest centroid. • Move centroids to center of assigned points Iteration = 2 Eytan Domany

K-means • Start with random positions of centroids. • Assign each data point to closest centroid. • Move centroids to center of assigned points • Iterate till minimal cost Iteration = 3 Eytan Domany

K-means - Summary • Fast algorithm: compute distances from data points to centroids • Result depends on initial centroids’ position • Must preset K • Fails for “non-spherical” distributions

2 4 5 3 1 1 3 2 4 5 Agglomerative Hierarchical Clustering Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers at each step merge pair of nearestclusters initially – each point = cluster Distance between joined clusters The dendrogram induces a linear ordering of the data points Dendrogram Eytan Domany

Hierarchical Clustering -Summary • Results depend on distance update method • Greedy iterative process • NOT robust against noise • No inherent measure to identify stable clusters • Average Linkage – the most widely used clustering method in gene expression analysis

nature 2002 breast cancer Heat map

Cluster both genes and samples • Sample should cluster together based on experimental design • Often a way to catch labelling errors or heterogeneity in samples

Epinephrine Treated Rat Fibroblast Cell

Correlation coeff Heap map Normalized across each gene

Pearson distance Distance Issues • Euclidean distance g1 g3 g2 g4

Exercise • Use Average Linkage Algorithm and Manhattan distance.

Exercise

Issues in Cluster Analysis • A lot of clustering algorithms • A lot of distance/similarity metrics • Which clustering algorithm runs faster and uses less memory? • How many clusters after all? • Are the clusters stable? • Are the clusters meaningful?

Which Clustering Method Should I Use? • What is the biological question? • Do I have a preconceived notion of how many clusters there should be? • How strict do I want to be? Spilt or Join? • Can a gene be in multiple clusters? • Hard or soft boundaries between clusters

The End • Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it. • We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have. • We wish you all have a wonderful summer break!

Microarray Data Analysis