1 / 25

Unsupervised Analysis

Learn how to find groups of genes with correlated expression profiles and divide conditions based on gene expression using clustering methods like K-means and Fuzzy K-means. Explore the applications and challenges of gene clustering and two-way clustering.

Download Presentation

Unsupervised Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Analysis • Goal A:Find groups of genes that have correlated expression profiles.These genes are believed to belong to the same biological process and/or are co-regulated. • Goal B:Divide conditions to groups with similar gene expression profiles.Example: divide drugs according to their effect on gene expression. Clustering Methods

  2. K-means: The Algorithm • Given a set of numeric points in d dimensional space, and integer k • Algorithm generates k (or fewer) clusters as follows: • Assign all points to a cluster at random • Compute centroid for each cluster • Reassign each point to nearest centroid • If centroids changed go back to stage 2

  3. Step 1: Make random assignments and compute centroids (big dots) Step 2: Assign points to nearest centroids Step 3: Re-compute centroids (in this example, solution is now stable) K-means: Example, k = 3

  4. Fuzzy K means • The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster. • The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:

  5. Fuzzy K means Algorithm • Make initial guesses for the means m1, m2,..., mk • Until there are no changes in any mean: • Use the estimated means to find the degree of membership u(j,i) of xj in Cluster i; for example, if dist(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i) = dist(j,i) / Sj dist(j,i) • For i from 1 to k • Replace mi with the fuzzy mean of all of the examples for Cluster i • end_for • end_until

  6. Time course experiment

  7. K-means: Sample Application • Gene clustering. • Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line. • Normalization allows comparisons across microarrays. • Produce clusters of genes which vary in similar ways over time. • Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway. Sample Array. Rows are genes and columns are time points. A cluster of co-regulated genes.

  8. Centroid Methods - K-means • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 3

  9. Application of K-means to tome course experiments

  10. Agglomerative Hierarchical Clustering • Results depend on distance update method • Single linkage: elongated clusters • Complete linkage: sphere-like clusters • Greedy iterative process • Not robust against noise • No inherent measure to choose the clusters

  11. Gene Expression Data • Cluster genes and conditions • 2 independent clustering: • Genes represented as vectors of expression in all conditions • Conditions are represented as vectors of expression of all genes

  12. First clustering - Experiments 1. Identify tissue classes (tumor/normal)

  13. Second Clustering - Genes Ribosomal proteins Cytochrome C metabolism HLA2 2.Find Differentiating And Correlated Genes

  14. Two-wayClustering

  15. Coupled Two-way Clustering (CTWC) • Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. • New Goal: Use subsets of genes to study subsets of samples (and vice versa) • A non-trivial task – exponential number of subsets. • CTWC is a heuristic to solve this problem.

  16. CTWC of Colon Cancer Data Tumor Normal (A) Protocol A Protocol B (B)

  17. Multiple Testing Problem • Simultaneously test m null hypotheses, one for each gene j Hj: no association between expression measure of gene j and the response • Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue • Increased chance of false positives

  18. Hypothesis Truth Vs. Decision Decision Truth

  19. Strong Vs. Weak Control • All probabilities are conditional on which hypotheses are true • Strong control refers to control of the Type I error rate under any combination of true and false nulls • Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true) • In general, weak control without other safeguards is unsatisfactory

  20. Adjusted p-values (p*) • Test level (e.g. 0.05) does not need to be determined in advance • Some procedures most easily described in terms of their adjusted p-values • Usually easily estimatedusing resampling • Procedures can be readily compared based on the corresponding adjusted p-values

  21. A Little Notation • For hypothesis Hj, j = 1, …, m observed test statistic: tj observed unadjusted p-value: pj • Ordering of observed (absolute) tj: {rj} such that |tr1|  |tr2|  …  |trG| • Ordering of observed pj: {rj} such that |pr1|  |pr2| …  |prG| • Denote corresponding RVs by upper case letters (T, P)

  22. Control of the type I errors • Bonferroni single-stepadjusted p-values pj* = min (mpj, 1) • Sidak single-step (SS) adjusted p-values pj * = 1 – (1 – pj)m • Sidak free step-down (SD) adjusted p-values pj * = 1 – (1 – p(j))(m – j + 1)

  23. Control of the type I errors • Holm (1979)step-down adjusted p-values prj* = maxk = 1…j {min ((m-k+1)prk, 1)} • Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.) • Hochberg (1988) step-up adjusted p-values (Simes inequality) prj* = mink = j…m {min ((m-k+1)prk, 1) }

  24. Control of the type I errors • Westfall & Young (1993) step-down minP adjusted p-values prj* = maxk = 1…j { p(maxl{rk…rm} Pl prkH0C )} • Westfall & Young (1993) step-down maxT adjusted p-values prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}

  25. Westfall & Young (1993) Adjusted p-values • Step-down procedures: successively smaller adjustments at each step • Take into account the joint distribution of the test statistics • Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values • Can be estimated by resampling but computer-intensive (especially for minP)

More Related