Complementarity of network and sequence information in homologous proteins

Complementarity of network and sequence information in homologous proteins Vesna Memišević2, Tijana Milenković2, and Nataša Pržulj1 1Department of Computing, Imperial College London, London, UK 2Department of Computer Science, University of California, Irvine, USA International Symposium on Integrative Bioinformatics March, 2010

Motivation • Genetic sequences – revolutionized understanding of biology • Non-sequence based data of importance, e.g.: • secondary & tertiary structure of RNA have the dominant role in RNA function (tRNA: Gautheret et al., Comput. Appl. Biosci., 1990) (rRNA: Woese et al., Microbiological Reviews, 1983) • Secondary structure-based approach – more effective at finding new functional RNAs than sequence-based alignments (Webb et al., Science, 2009) • What about patterns of interconnections in PPI networks? • Can they complement the knowledge learned from genomic sequence? • Wiring patterns of duplicated proteins in PPI net – insights into evol. dist.? • Does the information about homologues captured by PPI network topology differ from that captured by their sequence? Nataša Pržulj natasha@imperial.ac.uk

Background • Homologs– descend from a common ancestor: • Paralogs: in the same species, evolve through gene duplication events • Orthologs: in different species, evolve through speciation events Nataša Pržulj natasha@imperial.ac.uk

Background 4 4 • Sequence-based homology data from: • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

5 5 5 Background • Sequence-based homology data from: • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • KEGG Orthology System[2] [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

6 6 6 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • KEGG Orthology System[2] 1 1’ 5 2 3 7 6 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

Background 7 7 • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • Triangles in it found • KEGG Orthology System[2] 1 1’ 5 2 3 7 6 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

Background 8 8 8 • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • Triangles in it found • KEGG Orthology System[2] 1 1’ 2 3 7 6 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

Background 9 9 9 • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • Triangles in it found • Triangles sharing a side merged into the groups of orthologs and paralogs • KEGG Orthology System[2] 1 1’ 2 3 7 6 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

Background 10 10 10 10 • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • Triangles in it found • Triangles sharing a side merged into the groups of orthologs and paralogs • KEGG Orthology System[2] 1 1’ 2 3 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

Background 11 11 11 11 • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • Proteins in different genomes – sequence compared for the best hits (BeTs) • The graph of BeTs constructed • Triangles in it found • Triangles sharing a side merged into the groups of orthologs and paralogs • No dependence on the absolute level of similarity between compared proteins • KEGG Orthology System[2] 1 1’ 2 3 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

12 12 12 12 12 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • Sequences aligned • If alignment score < 10-8 then 1 assigned as “similarity bit” • Otherwise, 0 assigned as “similarity bit” • “Bit vectors” constructed for a protein, over all proteins • Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes • Cliques found in the graph = orthology groups [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

14 14 14 14 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • Sequences aligned • If alignment score < 10-8 then 1 assigned as “similarity bit” • Otherwise, 0 assigned as “similarity bit” • “Bit vectors” constructed for a protein, over all proteins • Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes • Cliques found in the graph = orthology groups 1 1’ 5 2 3 7 6 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

15 15 15 15 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • Sequences aligned • If alignment score < 10-8 then 1 assigned as “similarity bit” • Otherwise, 0 assigned as “similarity bit” • “Bit vectors” constructed for a protein, over all proteins • Graph constructed with nodes protein sequences and edges correlation coefficients of bit vectors of nodes • Cliques found in the graph = orthology groups • Again, no dependence on absolute level of similarity 1 1’ 5 2 3 7 6 4 [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

16 16 16 16 16 16 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • We examine yeast proteins only: • Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” • There are 9,643 of unique such pairs • What are their topological similarities within the PPI network? [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

17 17 17 17 17 17 17 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • We examine yeast proteins only: • Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” • There are 9,643 of unique such pairs • What are their topological similarities within the PPI network? [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

18 18 18 18 18 18 18 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • We examine yeast proteins only: • Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” • There are 9,643 of unique such pairs • What are their topological similarities within the PPI network? [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

19 19 19 19 19 19 19 19 Background • Sequence-based homology data from : • Clusters of Orthologous Groups – COG[1] • KEGG Orthology System[2] • Previous network-topology assisted approaches: • Network-alignment-based (ISORank) • Yosef, Sharan & Noble, Bioinformatics, 2008 (hybrid Rankprop) • Rely heavily on sequence information • Use only limited amount of network topology [1] Tatusov et al., BMC Bioinformatics, 4(41), 2003. [2] Kanehisa et al., Nucleic Acids Res., 28:27–30, 2000. Nataša Pržulj natasha@imperial.ac.uk

20 20 20 20 20 20 20 Our Method • We examine yeast proteins only: • Extract all possible pairs of them in COG and KEGG groups = “orthologous pairs” • There are 9,643 of unique such pairs • What are their topological similarities within the PPI network? • PPI networks are noisy • We analyze the high-confidence part of yeast PPI network by Collins et al.[3]: 9,074 edges amongst 1,621 proteins • Focus on proteins with degree > 3 to avoid noisy PPIs • There are 175 orthologous pairs amongst 181 proteins [3] Collins et al., Molecular and Cellular Proteomics, 6(3):439–450, 2008 Nataša Pržulj natasha@imperial.ac.uk

Our Method • Does PPI network topology contain homology information? • Are similarly wired proteins homologous? • Does homology information obtained from network topology differ from that obtained from sequence? Nataša Pržulj natasha@imperial.ac.uk

22 Our Method N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004. Nataša Pržulj natasha@imperial.ac.uk

23 23 Our Method • Induced • Of any frequency N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004. Nataša Pržulj natasha@imperial.ac.uk

Generalize node degree 24 24 24 Our Method N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007. Nataša Pržulj natasha@imperial.ac.uk

25 25 25 25 Our Method N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007. Nataša Pržulj natasha@imperial.ac.uk

26 26 26 26 26 Our Method N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007. Nataša Pržulj natasha@imperial.ac.uk

Our Method Graphlet Degree (GD) vectors, or “node signatures” T.Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008. Nataša Pržulj natasha@imperial.ac.uk

28 Our Method Similarity measure between nodes’ Graphlet Degree vectors T.Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008. Nataša Pržulj natasha@imperial.ac.uk

29 29 Our Method Signature Similarity Measure T.Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008. Nataša Pržulj natasha@imperial.ac.uk

30 30 Our Method • For the 181 proteins in 175 orthologous pairs, we find: • Graphlet degree vectors (GDVs) in the entire PPI network • GDV-similarities (GDS) = topological similarities • Sequence identities using Smith-Waterman local alignment with BLOSUM50 substitution matrix as the scoring scheme • We compare the GDV-similarity vs. sequence identity • topology vs. sequence Nataša Pržulj natasha@imperial.ac.uk

Results Network Topology • Orthologous pairs often perform the same or similar function. • Does GD vector similarity (GDS) imply shared biological function? • Note: most GO annotations were obtained from sequences • Similar topology ~ similar sequence ~ similar function Nataša Pržulj natasha@imperial.ac.uk

32 Results Network Topology • Orthologous proteins have high GD vector similarities Nataša Pržulj natasha@imperial.ac.uk

33 33 Results Network Topology • Orthologous proteins have high GD vector similarities p-value < 0.05 85% Nataša Pržulj natasha@imperial.ac.uk

34 34 34 Results Network Topology • Orthologous proteins have high GD vector similarities > 20% of orthologous pairs have GDS > 85% p-value < 0.05 85% Nataša Pržulj natasha@imperial.ac.uk

35 35 35 35 Results Network Topology – Robustness • PPI networks are noisy • Random edge additions, deletions and rewirings in the PPI net Nataša Pržulj natasha@imperial.ac.uk

36 36 36 36 36 Results Network Topology – Robustness • PPI networks are noisy • Random edge additions, deletions and rewirings in the PPI net Nataša Pržulj natasha@imperial.ac.uk

37 37 37 37 37 Results Network Topology – Robustness • PPI networks are noisy • Random edge additions, deletions and rewirings in the PPI net Nataša Pržulj natasha@imperial.ac.uk

38 38 38 38 38 38 Results Sequence • Sequence identities for the 175 orthologous pairs Nataša Pržulj natasha@imperial.ac.uk

39 39 39 39 39 39 39 Results Sequence • Sequence identities for the 175 orthologous pairs ~70% orth. pairs have seq. identity < 35% 35% Nataša Pržulj natasha@imperial.ac.uk

40 40 40 40 40 40 40 40 Results Sequence • Sequence identities for the 175 orthologous pairs ~20% orth. pairs have seq. identity > 90% 90% Nataša Pržulj natasha@imperial.ac.uk

41 41 41 41 41 41 41 41 41 Results Sequence • Sequence identities for the 175 orthologous pairs “Twilight zone” for homology ~70% orth. pairs have seq. identity < 35% • No dependence on the absolute similarity COG • & KEGG, but triangles in the graph of best matches 20-35% Nataša Pržulj natasha@imperial.ac.uk

Results Comparison: 20% 35% 85% ~20% of orthologous pairs have signature similarities above 85% (35 pairs) ~30% of orthologous pairs have sequence identities above 35% (53 pairs) • Overlap: 22 pairs (~60% of the smaller set) • Sequence and network topology • somewhat complementary slices of homology information Nataša Pržulj natasha@imperial.ac.uk

43 43 43 43 43 43 43 Results Examples • 59 of the yeast ribosomal proteins – retained two genomic copies • Are duplicated proteins functionally redundant? • No: have different genetic requirements for their assembly and localization so are functionally distinct • Also note: avg sequence identity of struct. similar prots ~8-10% • Two pairs with identical sequence: 100% sequence identity 50% signature similarity Degrees 25 and 5 Nataša Pržulj natasha@imperial.ac.uk

44 44 44 44 44 44 44 44 Results Examples • 59 of the yeast ribosomal proteins – retained two genomic copies • Are duplicated proteins functionally redundant? • No: have different genetic requirements for their assembly and localization so are functionally distinct • Also note: avg sequence identity of struct. similar prots ~8-10% • Two pairs with identical sequence: 100% sequence identity 65% signature similarity Degrees 54 and 9 Nataša Pržulj natasha@imperial.ac.uk

Conclusions • Homology information captured by PPI network topology differs from that captured by sequence • Complementary sources for identifying homologs • Future work: • Could topological similarity be used to identify orthologs from best-hits graph analysis as done for sequences?

Acknowledgements This project was supported by the NSF CAREER IIS-0644424 grant Nataša Pržulj natasha@imperial.ac.uk

Complementarity of network and sequence information in homologous proteins