1 / 67

Clustering Algorithms

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS. Clustering Algorithms.

nolcha
Download Presentation

Clustering Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CZ5225: Modeling and Simulation in BiologyLecture 3: Clustering Analysis for Microarray Data IProf. Chen Yu ZongTel: 6874-6877Email: yzchen@cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS

  2. Clustering Algorithms • Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it. • Anything will cluster! Garbage In means Garbage Out.

  3. Supervised vs. Unsupervised Learning • Supervised: there is a teacher, class labels are known • Support vector machines • Backpropagation neural networks • Unsupervised: No teacher, class labels are unknown • Clustering • Self-organizing maps

  4. Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log (Red intensity/Green intensity) = Log(Avg. PM - Avg. MM)

  5. -2 2 Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. 1.5 -0.8 1.8 0.5 -0.4 -1.3 1.5 0.8 Numeric Vector Line Graph Heatmap

  6. Expression Vectors As Points in ‘Expression Space’ t 1 t 2 t 3 G1 -0.8 -0.3 -0.7 G2 -0.7 -0.8 -0.4 G3 Similar Expression -0.4 -0.6 -0.8 G4 0.9 1.2 1.3 G5 1.3 0.9 -0.6 Experiment 3 Experiment 2 Experiment 1

  7. Cluster Analysis • Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.

  8. How can we do this? • What is closely related? • Distance or similarity metric • What is close? • Clustering algorithm • How do we minimize distance between objects in a group while maximizing distances between groups?

  9. Distance Metrics (5.5,6) • Euclidean Distance measures average distance • Manhattan (City Block) measures average in each dimension • Correlation measures difference with respect to linear trends (3.5,4) Gene Expression 2 Gene Expression 1

  10. Clustering Gene Expression Data Expression Measurements • Cluster across the rows, group genes together that behave similarly across different conditions. • Cluster across the columns, group different conditions together that behave similarly across most genes. j Genes i

  11. Clustering Time Series Data • Measure gene expression on consecutive days • Gene Measurement matrix • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5]

  12. Euclidean Distance • Distance is the square root of the sum of the squared distance between coordinates

  13. City Block or Manhattan Distance • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5] • Distance is the sum of the absolute value between coordinates

  14. Correlation Distance • Pearson correlation measures the degree of linear relationship between variables, [-1,1] • Distance is 1-(pearson correlation), range of [0,2]

  15. Similarity Measurements • Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

  16. Similarity Measurements • Cosine Correlation +1  Cosine Correlation  – 1

  17. Hierarchical Clustering • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. • Display the data as a heatmap and dendrogram • Cluster genes, samples or both (HCL-1)

  18. Hierarchical Clustering Venn Diagram of Clustered Data Dendrogram

  19. Hierarchical clustering • Merging (agglomerative): start with every measurement as a separate cluster then combine • Splitting: make one large cluster, then split up into smaller pieces • What is the distance between two clusters?

  20. Distance between clusters • Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster • Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster • Average: Distance between the average of all points in each cluster • Ward: minimizes the sum of squares of any two clusters

  21. Hierarchical Clustering-Merging • Euclidean distance • Average linking Distance between clusters when combined Gene expression time series

  22. Manhattan Distance Distance between clusters when combined • Average linking Gene expression time series

  23. Correlation Distance

  24. Data Standardization • Data points are normalized with respect to mean and variance, “sphering” the data • After sphering, Euclidean and correlation distance are equivalent • Standardization makes sense if you are not interested in the size of the effects, but in the effect itself • Results are misleading for noisy data

  25. Distance Comments • Every clustering method is based SOLELY on the measure of distance or similarity • E.G. Correlation: measures linear association between two genes • What if data are not properly transformed? • What about outliers? • What about saturation effects? • Even good data can be ruined with the wrong choice of distance metric

  26. A B C D Hierarchical Clustering Initial Data Items Distance Matrix

  27. A B C D Hierarchical Clustering Initial Data Items Distance Matrix

  28. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 2

  29. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  30. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  31. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 3

  32. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  33. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

  34. A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 10

  35. A B C D Hierarchical Clustering Single Linkage Final Result Distance Matrix

  36. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  37. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  38. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  39. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  40. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  41. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  42. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  43. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering

  44. Hierarchical Clustering H L

  45. Hierarchical Clustering Samples Genes The Leaf Ordering Problem: • Find ‘optimal’ layout of branches for a given dendrogram architecture • 2N-1 possible orderings of the branches • For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations

  46. Hierarchical Clustering The Leaf Ordering Problem:

  47. Hierarchical Clustering • Pros: • Commonly used algorithm • Simple and quick to calculate • Cons: • Real genes probably do not have a hierarchical organization

  48. Using Hierarchical Clustering • Choose what samples and genes to use in your analysis • Choose similarity/distance metric • Choose clustering direction • Choose linkage method • Calculate the dendrogram • Choose height/number of clusters for interpretation • Assess results • Interpret cluster structure

  49. Choose what samples/genes to include • Very important step • Do you want to include housekeeping genes or genes that didn’t change in your results? • How do you handle replicates from the same sample? • Noisy samples? • Dendrogram is a mess if everything is included in large datasets • Gene screening

  50. No Filtering

More Related