1 / 42

Cluster analysis  Function

Cluster analysis  Function. Places genes with similar expression patterns in groups. Sometimes genes of unknown function will be grouped with genes of known function. The functions that are known allow the investigator to hypothesize regarding the functions of genes not yet characterized.

mick
Download Presentation

Cluster analysis  Function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster analysis  Function • Places genes with similar expression patterns in groups. • Sometimes genes of unknown function will be grouped with genes of known function. • The functions that are known allow the investigator to hypothesize regarding the functions of genes not yet characterized. • Examples: • Identify genes important in cell cycle regulation • Identify genes that participate in a biosynthetic pathway • Identify genes involved in a drug response • Identify genes involved in a disease response

  2. FUNCTIONAL GENOMICS CLUSTER ANALYSIS OF MICROARRAY DATA Cluster analysis software is developed to group genes with similar patterns of expression. In this example, the columns represent different timepoints, and the rows represent the results for a single gene. The products of the genes expressed in a single cluster may have related or similar functions. (11)

  3. CLUSTER ANALYSIS

  4. OUTLINE OF TALK

  5. Clustering - Goal • Partition of the genes in the dataset into distinct sets (clusters), according to similarity in their expression profiles across the probed conditions

  6. MATRIXgenes,conditions = Expression datasetthe first genevector = (x11, x12, x13, x14… x1n)the leftmost condition vector = (x11, x21, x31 … xm1) Columns (conditions [timpepoints, or tissues]) x11 , x12 , x13 , … x1n x21 x31 , … Xm1 … xmn Rows (genes)

  7. Clustering yeast cell cycle dataset VS gene tree ordering

  8. Clustering – why? • Reduce the dimensionality of the problem – identify the major patterns in the dataset • Similar expression profiles suggest functional relationship • Functional annotation of ESTs • Links among pathways • Related functions suggest coordinated regulatory control • Dissection of regulatory networks

  9. Similarity measures • Clustering identifies group of genes with “similar” expression profiles • How is similarity measured? • Euclidian distance • Correlation coefficient • Others

  10. In an experiment with 10 conditions, the gene expression profiles for two genes X, and Y would have this form X = (x1, x2, x3, …, xm) Y = (y1, y2, y3, …, ym)

  11. Similarity measure - Euclidian distance In general: if there are M experiments: X = (x1, x2, x3, …, xm) Y = (y1, y2, y3, …, ym)

  12. Similarity measure – Correlation Coefficient X = (x1, x2, x3, …, xm) Y = (y1, y2, y3, …, ym) -1 ≤ S(X,Y) ≤ 1

  13. Euclidian vs Correlation • Euclidian distance – takes into account the magnitude of the expression • Correlation distance - insensitive to the amplitude of expression, takes into account the trends of the change. • Common trends are considered biologically relevant, the magnitude is considered less important → correlation Gene Y Gene X

  14. What euclidean distance sees What correlation distance sees

  15. PCA • A technique for projecting the expression data set onto a reduced (2 or 3 dimensional) easily visualized space • Dataset: Thousands of genes probed in 10 conditions. • The expression profile of each gene is presented by the vector of its expression levels: X = (X1, X2, X3, X4, X5) • Imagine each gene X as a point in a 5-dimentional space. • Each direction/axis corresponds to a specific condition • Genes with similar profiles are close to each other in this space • PCA- Project this dataset to 2 dimensions, preserving as much information as possible

  16. PCA transformation of a microarray dataset Visual estimation of the number of clusters in the data

  17. Clustering Algorithms • K–means • SOMs • Hierarchical clustering

  18. K-MEANS • The user sets the number of clusters- k • Initialization: each gene is randomly assigned to one of the k clusters • Average expression vector is calculated for each cluster (cluster’s profile) • Iterate over the genes: • For each gene- compute its similarity to the cluster profiles. • Move the gene to the cluster it is most similar to. • Recalculated cluster profiles. • Score current partition: sum of distances between genes and the profile of the cluster they are assigned to (homogeneity of the solution). • Stop criteria: further shuffling of genes results in minor improvement in the clustering score

  19. How to choose the number of clusters needed to informatively partition the data Try several parameters (number of desired clusters, distance metric) and compare the clustering solutions • Criteria for comparison: Homogeneity vs Separation • Use PCA (Principle Component Analysis)

  20. Mean profile Standard deviation in each condition K-MEANS example: 4 clusters

  21. Evaluating Kmeans Cluster 1 Cluster 3 Mis-classified Cluster 4 Cluster 2

  22. K-means example: 3 clusters

  23. Too few clusters: K=2

  24. SOMs (Self-Organizing Maps) • User sets the number of clusters in a form of a rectangular grid (e.g., 3x2) – ‘map nodes’ • Imagine genes as points in (M-dimensional) space • Initialization: map nodes are randomly placed in the data space

  25. Genes – data points Clusters – map nodes

  26. SOM - Scheme • Randomly choose a data point (gene). • Find its closest map node • Move this map node towards the data point • Move the neighbor map nodes towards this point, but to lesser extent (thinner arrows show weaker shift) • Iterate over data points

  27. Each successive gene profile (black dot) has less of an influence on the displacement of the nodes. • Iterate through all profiles several times (10-100) • When positions of the cluster nodes have stabilized, assign each gene to its closest map node (cluster)

  28. {1,2,3,4,5} {1,2,3} {4,5} {1,2} g1 g2 g3 g4 g5 Hierarchical Clustering • Goal#1: Organize the genes in a structure of a hierarchical tree • 1) Initial step: each gene is regarded as a cluster with one item • 2) Find the 2 most similar clusters and merge them into a common node (red dot) • 3) Merge successive nodes until all genes are contained in a single cluster • Goal#2: Collapse branches to group genes into distinct clusters

  29. Mathematical evaluation of clustering solution Merits of a ‘good’ clustering solution: • Homogeneity: • Genes inside a cluster are highly similar to each other. • Average similarity between a gene and the center (average profile) of its cluster. • Separation: • Genes from different clusters have low similarity to each other. • Weighted average similarity between centers of clusters. • These are conflicting features: increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation

  30. Performance on Yeast Cell Cycle Data CAST* “True” CLICK GeneCluster Separation K-means Homogeneity 698 genes, 72 conditions (Spellman et al. 1998). Each algorithm was run by its authors in a “blind” test. *Ben-Dor, Shamir, Yakhini 1999

  31. Overall strategy: PCA-transformation Clustering and evaluation of clustering Check for bio-significance

  32. Which genes to cluster? • Apply filtering prior to clustering – focus the analysis on the ‘responding genes’ • Applying controlled statistical tests to identify ‘responding genes’ usually ends up with too few genes that do not allow for a global characterization of the response. • Fold change: choose genes that changed by at least M-folds in at least L conditions • Variance: filter out genes that do not vary greatly among the conditions of the experiment. Try various filtering schemes to find the setting that gives the best biological results

  33. Clustering – Tools • Cluster (Eisen) – hierarchical • GeneCluster (Tamayo) – SOM • TIGR MeV – K-Means, SOM, hierarchical, QTC, CAST • Expander – CLICK, SOM, K-means, hierarchical • Many others (e.g. GeneSpring)

  34. CSLA Workshop, Day2 Presentation created by Rani Elkon and posted at: http://www.tau.ac.il/lifesci/bioinfo/teaching/2002-2003/DNA_microarray_winter_2003.html

  35. Ascribing Biological Meaning to Clusters • Identify over-represented functional categories in the clusters (i.e., cluster contains much more genes of specific biological process than expected by chance) • Requirements for systematic analysis: • Controlled vocabulary for describing biological processes (protein biosynthesis\translation, apoptosis\programmed cell death) • Standard assignment of genes into functional categories

  36. Gene Ontology (GO) project • Purpose: 1) Define controlled terms (ontologies) for description of gene products from 3 aspects: • Biological process (DNA repair, mitosis) • Molecular function (protein serine/threonine kinase activity, transcription factor activity) • Cellular component (nucleus, ribosome) 2) Establish a unified framework for organism-independent gene annotation • Characteristics: 1) A gene can have multiple associations in each ontology 2) GO terms are organized in hierarchical structures called directed acyclic graphs (DAGs) - The most general classifications are at top levels of the graph - More specialized classifications at lower levels

  37. Hierarchical classification scheme for proteins that function in M-phase of mitosis Each gene can be a member of more than one GO classifications

  38. Online Databases that annotate genes by GO • Human • Entrez http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene • GOA http://www.ebi.ac.uk/GOA/ • Mouse – Mouse Genome Informatics (MGI) • http://www.informatics.jax.org/ • Rat – Rat Genome Database • http://rgd.mcw.edu/ • Fly – FlyBase • http://flybase.bio.indiana.edu/ • Arabidopsis – TAIR • http://www.arabidopsis.org/ • Yeast – Sacchromaces Genome Database • http://www.yeastgenome.org/ • Affymetrix chips – Netaffx • http://www.affymetrix.com

  39. Example: Cluster 3, 95 genes

  40. Identifying enriched GO categories in clusters • In the previous example: • Total number of chip’s genes with annotation = 5000 • Total number of chip’s genes associated with metabolism GO category = 3,600 (72%) • Number of annotated genes in cluster 3 = 73 • Number of metabolic genes in cluster 3 = 50 (68%) • Statistical tests are essential to determine whether enrichment of a certain class of proteins is significant

  41. Acknowledgements • SOM Figures in this presentations were taken from presentation of Benedikt Brors

More Related