CISC 841 - BIOINFORMATICS

CISC 841 - BIOINFORMATICS GENE ANNOTATION AND NETWORK INFERENCE BY PHYLOGENETIC PROFILING Authors : Jie Wu, Zhenjun Hu and Charles DeLisi Boston University Presented by, Rajesh Ponnurangam

MOTIVATION • The need for effective decision rule to use for correlation • Inefficiencies of current methods • Effectiveness of Phylogenetic analysis • Need for improved performance at various levels of Resolution • New Decision Rule – Correlation Enrichment

OVERVIEW • Introduction Concepts • What’s wrong with existing technologies of decision making? • Comparison of Decision Rules • Comparison with other published methods • Identifying functional and evolutionary modules • Standard Guilt by Association (SGA) • Correlation Enrichment (CE) • How correlation enrichment (CE) proves to be more effective?

INTRODUCTION CONCEPTS • Gene Annotation • Network inference • Phylogenetic profiling • Correlation Enrichment (CE) • Standard guilt by association (SGA) • KEGG Pathways • COG Ontology

INTRODUCTION CONCEPTS • Gene Annotation – The process of attaching biological information to sequences • identifying elements on the genome (gene finding), and • attaching biological information to these elements • Network Inference – Knowing the topology of a biological network like transcriptional regulatory networks, metabolite networks etc. • Phylogentic Profiling – Used to infer the function of a gene by finding another gene of known function with an identical pattern of presence and absence across a set of distributed genomes

INTRODUCTION CONCEPTS • Correlation Enrichment – A new decision rule for assigning genes to functional categories at various levels of resolution • Standard Guilt by Association (SGA) – Simple decision rule, which assigns an unannotated gene to all known categories of an annotated gene if the phylogenetic profiles exceed some specific correlation threshold • KEGG – Kyoto Encyclopedia of Genes and Genomes, connects known information on molecular interaction networks • COG Ontology – Cluster of Orthologous Genes, source of a conserved domain datasource

EXISTING METHOD AND PHYLOGENETIC PROFILING • Current methods like SGA, perform at a level well below what is possible, largely because the performance of an effective decision rule to use the correlate deteriorates rapidly as coverage increases • Phylogenetic profiling provides restricted profiling, requiring full profile identity, while accurate, has low coverage • Phylogenetic profiling of a gene is a binary string • Presence – 1 • Absence – 0

PHYLOGENETIC PROFILING • N -> Number of genomes over which profiles are defined • with gene X occurring in x genomes and gene Y occurring in y genomes and both occurring in z genomes, the probability of observing z co-occurrences purely by chance, given N,x and y is, • MI(X,Y) - p(i,j), (i=0,1; j=0,1), fraction of genomes in which gene X is in state i and gene Y is in state j • p(1,1) – fraction of genomes in which both are present • p(1,0) – fraction of genomes in which X is present and Y is absent

PHYLOGENETIC PROFILING Also Then the relation between MI and eq(1) is The paper defines a new measure of correlation between two binary strings 0 ≤ C ≤ 1 (3b)

COMPARISON OF SGA & CE • SGA assigns an unannotated gene to all known categories of an annotated gene if profiles exceed some correlation threshold. • CE assigns an unannotated gene by ranking each category (pathway) with a score reflecting • The number of genes (annotated) within a category, whose profile correlation with that of the unannotated gene exceeds a pre-specified threshold • The magnitude of these correlations • CE substantially outperforms SGA in allocating genes to functional categories • SGA, for C*=0.35, links 1025 of 2918 unannotated orthologs to one pathway • CE was able to assign all 2918 KEGG unannotated orthologs to pathways and all COG unannotated orthologs to COG categories

PATHWAY ALLOCATION PERFORMANCE

COMPARISON OF DECISION RULES

COMPARISON OF DECISION RULES • SGA – assignment based on profile identity • For inferences based on identity only 5.4% of unannotated orthologs are assignable to KEGG pathways • When C*=0.2 to achieve a coverage of 90% requires accepting a PPV of 6% • For inferences based on CE, PPV is markedly increased at high coverage, exceeding its SGA value approximately 6 fold • The two decision rules perform similarly at coverages below 20% • PPV estimates are conservative • CE performs superior than SGA

COMPARISON OF DECISION RULES • At C*=0.4 where SGA and CE curves for PPV have reached about half their maximum divergence, CE performs substantially better than SGA at GO specificity levels.

COMPARISON WITH OTHER PUBLISHED MODELS • Different methods to draw functional inferences like “majority vote” and Markov Random Field can assign function based on the network context of unannotated genes • Predictive reliability can be increased by combining them using one or another statistical framework such as support vector machines, Bayesian inferences and Markov Random field. • Using SGA to assign genes to GO categories, fraction of genes assigned to at least one category decreases from 0.98 to ~0.10 as functional specificity increases with coverage fixed at 40% • Using CE, the fraction correctly assigned to at least one category is 0.95 at the lowest specificity level and remains 0.78 at all specificity levels

INFERENCES BASED ON COG ONTOLOGY • COG functional categories provide a low resolution, but fully resolved annotation • 1 gene to 1 functional category mapping • Profiling by CE of the full set of 4826 genes, at C*=0.55 returns a 926 genes linked to at least one annotated gene • Each of the 926 genes, including 249 unannotated are assignable to COG category • Performance estimation – 68% (463/677)

INFERENCES BASED ON COG ONTOLOGY

INFERENCES BASED ON COG ONTOLOGY • A more detailed version of the category H TP set reveals two strikingly dense clusters – one with 7 orthologs, the other with 11

PHYLOGENETIC PROFILES OF THE 11-MEMBER CLUSTER Phlogenetic profiles of the 11-member cluster of orthologs across 66 genomes uncovered by CE. Green represents absence and red, presence of an ortholog

CLIQUES, CLUSTERS & INFERENCE QUALITY • As the threshold decreases from its most stringent value (C*=0.91), the number of clusters containing more than 3 nodes increases, peaking at C*=0.66 and then declines as the nodes coalesce into increasingly larger clusters

CLIQUES, CLUSTERS & INFERENCE QUALITY

METHODS • Dataset – COG database. Accuracy is evaluated against KEGG • Assessment • Positive Predictive Value – By definition, the population averaged positive predictive value is,

METHODS • PPV as a product of two factors • Related Metrics • SPE-ACC • SEN-A0

STANDARD GUILT BY ASSOCIATION let i be the number of the categories that contain the gene I let J(I, J) be the set of categories that contain a gene J whose profile correlation with I meets the threshold C*, j(I, J) is its size let K(I, J) denote the set of common categories and k(I, J) is its size; where 0 ≤ k(I, J) ≤ min(i, j). The unannotated gene is therefore correctly assigned to TP = k categories and incorrectly assigned to the remaining FP = j - k categories. Also TN = T - i - j + k and FN = i - k, where T = 133 is the total number of pathways. Consequently, the PPVI(J) with which gene I is assigned using linked gene J is

STANDARD GUILT BY ASSOCIATION Maximum PPVI(J) is not necessarily 1, but min(i,j)/j For j<I, PPVI(J)<1, whereas when i>j, PPVI(J) can become 1 when the pathways of J are a subset of I. The positive predictive value for gene I is obtained by taking sums over all genes to which its is correlated Where G(I) is the number of genes correlated with gene I and Nc(I) is the subset of genes in G(I) that share at least one category with gene I

CORRELATION ENRICHMENT Suppose an unannotated gene is correlated with in total g other genes (C > C*) from r categories, and let m1, m2, ..., mrbe the number of correlated genes in categories k1, k2, ..., kr, where r ≤ g, the equality holding only when each gene is in one category. Further, let denote the categories the gene is in. For each of the r categories that have 1 or more genes meeting the correlation threshold with I, define a weighted sum score, Sv α is positive adjustable integer which gives disproportionately high weights to strong correlations

CORRELATION ENRICHMENT FP = r0 – TP FN = T1 – TP TN = T – r0 – T1 + TP

REFERENCES • Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12):1524-1530 • Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.Proc Natl Acad Sci U S A 1999, 96(8):4285-4288 • Aravind L: Guilt by association: contextual information in genome analysis.Genome Res 2000, 10(8):1074-1077 • Nariai N, Tamada Y, Imoto S, Miyano S: Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data.Bioinformatics 2005, 21 Suppl 2:ii206-ii212.

Thank you

CISC 841 - BIOINFORMATICS

CISC 841 - BIOINFORMATICS

Presentation Transcript

CISC Processor

A1 (594 x 841 mm)

CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I)

CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (II)

CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction

CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing

CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment

CISC 841 Bio Informatics

CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure

CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment

CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search

CISC 841 Bioinformatics (Fall 2008) Hidden Markov Models

CISC 467/667 Intro to Bioinformatics (Spring 2007) Protein Structure Prediction

CISC 667 Intro to Bioinformatics (Spring 2007) Phylogenetic Trees (III)

CISC 841 Bioinformatics (Fall 2008) Hidden Markov Models

CISC 841 Bioinformatics (Fall 2008) Review Session

CISC 841 Bioinformatics Combining HMMs with SVMs

TWR 841 Characteristics

CISC 841 - BIOINFORMATICS

CISC 841 Bio Informatics