1 / 41

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008. Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles. Outline. Introduction Divisive Correlation Clustering Algorithm Results Conclusions. Outline.

ramona
Download Presentation

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008 Divisive Correlation Clustering Algorithm (DCCA) for groupingof genes: detecting varying patterns in expression profiles

  2. Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions

  3. Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions

  4. Introduction • Correlation Clustering

  5. Correlation Clustering • Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004. • It is basically based on the notion of graph partitioning.

  6. Correlation Clustering • How to construct the graph? • Nodes: genes. • Edges: correlation between the genes. • Two types of edges: • Positive edge. • Negative edge.

  7. Correlation Clustering • For example: Positive correlation coefficient: Positive edge( ) X Y Negative correlation coefficient: Negative edge( ) X Y Graph Construction A A Cluster 1 Graph Partitioning B B C C G G E E D D G G F F Cluster 2 H H

  8. Correlation Clustering • How to measure the quality of clusters? • The number of agreements. • The number of disagreements. • The number of agreements: the number of genes that are in correct clusters. • The number of disagreements: the number of genes wrongly clustered.

  9. Correlation Clustering • For example: The measure of agreements is the sum of:(1) # of positive edges in the same clusters(2) # of negative edges in different clusters Cluster 1 A B C 4 + 4 = 8 D E The measure of disagreements is the sum of:(1) # of negative edges in the same clusters(2) # of positive edges in different clusters Cluster 2 0 + 2 = 2

  10. Correlation Clustering • Minimization of disagreements or equivalently Maximization of agreements! • However, it’s NP-Complete proved by Bansal et al., 2004. • Another problem is without the magnitude of correlation coefficients.

  11. Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions

  12. Divisive Correlation Clustering Algorithm • Pearson correlation coefficient • Terms and measurements used in DCCA • Divisive Correlation Clustering Algorithm

  13. Pearson correlation coefficient • Consider a set of genes, , for each of which expression values are given. • The Pearson correlation coefficient between two genes and is defined as: lth sample value of gene mean value of gene from samples

  14. Pearson correlation coefficient • : and are positively correlated with the degree of correlation as its magnitude. • : and are negatively correlated with value .

  15. Terms and measurements used in DCCA • We define some terms and measurements used in DCCA: • Attraction • Repulsion • Attraction/Repulsion value • Average correlation value

  16. Terms and measurements used in DCCA • Attraction: There’s an attraction between and if . • Repulsion: There’s a repulsion between and if . • Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion.

  17. Terms and measurements used in DCCA • The genes will be grouped into disjoint clusters . • Average correlation value: Average correlation value for a gene with respect to cluster is defined as: the number of data points in

  18. Divisive Correlation Clustering Algorithm • indicates that the average correlation for a gene with other genes inside the cluster . • Average correlation value reflects the degree of inclusion of to cluster .

  19. Divisive Correlation Clustering Algorithm • Divisive Correlation Clustering Algorithm m samples K disjoint clusters X1 1 m DCCA C1 C2 Ck n genes Xn 1 m

  20. Divisive Correlation Clustering Algorithm • Step 1: • Step 2: for each iteration, do: • Step 2-i:

  21. Divisive Correlation Clustering Algorithm • Step 2: • Step 2-ii: • Step 2-iii: Which cluster exists the most repulsion value? C1 C2 Cp Cluster C!

  22. Divisive Correlation Clustering Algorithm • Step 2-iv: Cp xi xk xi xk xk xk xk Cq xk xj xj xk Cluster C

  23. Divisive Correlation Clustering Algorithm • Step 2-v: xk Place a copy of xk C1 C2 CK C1 C2 CK xk The highest average correlation value! CNEW: new clusters

  24. Divisive Correlation Clustering Algorithm • Step 2-vi: C1 C2 CK Any change? C1 C2 CK CNEW: new clusters

  25. Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions

  26. Results • Performance comparison • A synthetic dataset ADS • Nine gene expression datasets

  27. Performance comparison • Asynthetic dataset ADS: Three groups.

  28. Performance comparison • Experimental results: Clustering correctly.

  29. Performance comparison • Experimental results: Undesired Clusters. Undesired Clusters.

  30. Performance comparison • Five yeast datasets: • Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al. • Four mammalian datasets: • GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745.

  31. Performance comparison • Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. Mutual information The entropies for each cluster-attribute pair. Attributes The entropies for each of the NA attributes independent of clusters. The entropies for clustering result independent of attributes.

  32. Performance comparison • z-score is defined as: Mean of these MI-values. The computed MI for the clustered data, using the attribute database. The standard deviation of these MI-values. MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained.

  33. Performance comparison • A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result. • Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets.

  34. Performance comparison • Experimental results:

  35. Performance comparison • Experimental results:

  36. Performance comparison • Experimental results:

  37. Performance comparison • Experimental results:

  38. Performance comparison • Experimental results:

  39. Outline • Introduction • Divisive Correlation Clustering Algorithm • Results • Conclusions

  40. Conclusions • Pros: • DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance. • DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input.

  41. Conclusions • Cons: • The computation cost for repairing any misplacement occurring in clustering step is high. • DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1.

More Related