1 / 29

Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering

Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering. Dikla Dotan -Cohen, Simon Kasif and Avraham A. Melkman Vol. 25 14 2009, pages 1789-1795 BIOINFORMATICS. Outline. Introduction Method Results Discussion. Introduction.

mills
Download Presentation

Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering DiklaDotan-Cohen, Simon Kasif and Avraham A. Melkman Vol. 25 14 2009, pages 1789-1795BIOINFORMATICS

  2. Outline • Introduction • Method • Results • Discussion

  3. Introduction • Incorporating prior knowledge such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance. • The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity.

  4. Introduction • The paper propose a novel algorithm that integrates • Semantic similarities(derived from the ontology structure) • Expression-based hierarchical clustering.

  5. Introduction • Methods in the past few years: • Distance-based methods: • Hanischet al. (2002) • Constructed a distance function which combines information from expression data and the proximity of the protein in a metabolic pathway network. • Huang and Pan (2006) • Propose shrinking the distance between pairs of genes that share a common annotation • Shrinkage parameter • Speer et al. (2004) • Modified the similarity between two genes to be a linear combination of the similarity of their expression profiles and their functional similarity.

  6. Introduction • D.Dotan-Cohen et al. • Modify the hierarchical clustering method by “snipping” the hierarchical tree to obtain clusters that are as consistent as possible with the known annotation. • Most of these clustering method utilize only the annotations provided by the GO. • A further aspect of the ontology : its hierarchical structure

  7. Introduction • Motivating considerations for integrating semantic similarity measures into clustering method: • Enhance performance of the clustering • Good correlation between semantic similarity and gene coexpression. • It enables the analysis to retain all the annotations of genes • By definition, all genes are annotated with the most general GO term (the root of the hierarchy, e.g. the GO term ‘biological process’) • Majority of genes have multiple annotations.

  8. Introduction • Several algorithm, such as Dotan-Cohen et al. or Huang and Pan, ask only whether a gene does or does not have a give annotation. • Such algorithms can only be used when none of the terms that genes are annotated with is a descendant term of another. • Make no sense to scrutinize a cluster for annotations that are shared by the genes in the cluster.

  9. Methods - Expression similarity • The expression similarity between two genes g1 and g2 was taken to be • is the Pearson correlation between the expression profiles of g1 and g2.

  10. Methods - Semantic similarity measures • Semantic similarity between two GO terms • P(t) : the probability of term t to occur, which is estimated as the number of gene products annotated with the term or any more specific term in the database, divided by the total number of annotated genes. • Relevance measure of Schilickeret al. (2006) • S(t1, t2) is the set of common ancestor of term t1 and t2. • the relevance measure is sensitive to the level.

  11. Methods - Semantic similarity measures • Another definition given by Lin(1998): • It concentrates on the closeness of the terms to their common ancestor. • Insensitive to the level of detail. • Two general terms are given the same similarity score as two specific terms provided the two pairs are equally close to their lowest common ancestors.

  12. Methods - Semantic similarity measures • Semantic similarity between a gene and a GO term • Difficulty : a gene g is usually annotated with a list of GO terms, GO(g), because it participates in more than one biological process • Similarity equal to the maximum possible similarity between the term and any of the annotations of the gene:

  13. Methods - Semantic similarity measures • Semantic similarity between two genes • The semantic similarity between two genes could be set equal to the maximum similarity between any pair of term and . • This is the kind of definition adopted by Resnik (1999) for the similarity between words with multiple meanings in a lexical database, on the premise that a word has usually a single meaning in a given context. • This definition may not be appropriate because of the co-occurrence of several process.

  14. Methods - Semantic similarity measures • We followed Schlickeret al. (2006) • Defining the similarity between gene g1 and g2 as the average of the following row and column score:

  15. Methods - Unannotated genes • Unannotatedgenes • To deal with the sizable percentage of genes that are currently unannotated. • One option ,Speer et al. (2004), is to exclude them from the clustering. • Another option, Kustra and Zagdanski (2007), is to base the semantic similarity between an unannotated gene and an annotated one on the expression similarity between them. • In this paper, give such genes the most general annotation, the root of the ontology hierarchy(e.g. ‘biological process’, GO:0000004) • Advantage : not drawing a distinction between a gene whose annotation is completely unknown and one that has a known annotation that is not very specific.

  16. Methods - Algorithm • Standard hierarchical clustering • Hierarchical agglomerative clustering algorithm to create a binary tree • Each leaf corresponds to a different gene • Created by recursively and greedily joining the nodes that are closest to each other according to some criterion • K clusters are produced by making a horizontal cut • Assigning all genes in a subtree to the same cluster

  17. Methods - Algorithm • Minimum misclassification clustering • Snip selected edges at varying heights to take advantage of the partially available annotation of genes. • Maximally consistent with the partially available background knowledge. • Assign each resulting cluster c a lable, label(c), from among a set of GO terms. • A gene g belonging to a cluster c is said to be misclassified if it is not annotated with label(c). • The problem is to find, among all possible K-partitions and all possible labeling of the resulting clusters, one for which the total number of misclassified genes is minimized.

  18. Methods - Algorithm • Dynamic programming Minimum misclassification algorithm • Traverse the tree in a bottom-up fashion, while computing for each node , functional label and number of snips the value equals the minimal number of misclassified leaves

  19. Methods - Algorithm • Minimum discrepancy clustering • Using a measure of the extent of misclassification of a gene g with respect to a label l, its discrepancy denoted which varies between 0 and 1. • equal to the minimum of • Find a partitionP of the tree into a set of K clusters, and to find a labeling ‘label’ that assigns each cluster c in P a cluster label label(c), such that similarity discrepancy is minimize.

  20. Methods - Algorithm • The balance algorithm • Any k-partition P of the dendrogram is obtained by snipping an edge to a child at k-1 nodes. Call these nodes snipping nodes, and denote their set by S(P) • is the largest distance between the expression patterns of two genes clustered in the subtree rooted at . • Sum of the heights of the snipping nodes

  21. Methods - Algorithm • Intuitively, it is desirable to maximize . • Another , possibly conflicting, desirable goal is to minimize the similarity discrepancy.Develop an algorithm that strikes a balance between minimizing SD and maximizing H. • Minimizing the function • Minimum discrepancy algorithm. Standard hierarchical clustering

  22. Methods - Performance evaluation • Evaluated in two different ways • protein-protein interaction (PPI) • How many of the potential interactions between genes in a cluster actually appear in the database of PPI. • Interacting genes are strongly constrained. • For each cluster of n genes, we computed the percentage of the number of interactions found in the PPI data, relative to the total number of possible interaction .

  23. Methods - Performance evaluation • A series of cross-validation tests in which a biological process (BP) annotation was determined for each cluster • Predict a label for each cluster with a GO term. • Strict accuracy : the ration of the number of predictions that are true to the total number of predictions. • Similarity weighted (sw-) accuracy measure : the semantic similarity between actual annotation of the gene and the predicted process term. • The algorithm are not designed for functional prediction, so that the cross-validation served only to evaluate the quality of the clusterings.

  24. ResultsFirst experiment : using all the GO terms • These Figure presents the percentage of PPI pairs out of all possible pairs

  25. Results First experiment : using all the GO terms • The cross-validation tests • Accuracy measure is the ratio of the number of true predictions to the total number of predictions, averaged over the five subsets.

  26. Results Second experiment : 10 Go terms • Use 10 randomly chosen functional annotations to serve as labels of to the genes.

  27. Results Second experiment : 10 Go terms

  28. Discussion • Semantic similarity measures extract a significant part of the biological knowledge captured by the GO. • A further advantage of integrating semantic similarities into the clustering procedure is that it enables the retention and exploitation of multiple annotations.

  29. Discussion • Superior to the competitor approaches • Semantic similarity is brought into play after the genes have already been clustered by expression similarity. Semantic similarity plays a significant role only in borderline cases. • It treats the annotations in a uniform and consistent manner.

More Related