370 likes | 559 Views
Comparative genome analysis. Hard data and soft interpretations?. Peer Bork EMBL & MDC Heidelberg & Berlin. bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/. Sequenced eukaryotic genomes. Bork and Copley Nature 409(01)818. Sources of uncertainties. (human genome draft).
E N D
Comparative genome analysis Hard data and soft interpretations? Peer Bork EMBL & MDC Heidelberg & Berlin bork@embl-heidelberg.de http://www.bork.embl-heidelberg.de/
Sequenced eukaryotic genomes Bork and Copley Nature 409(01)818 www.bork.embl-heidelberg.de
Sources of uncertainties (human genome draft) • Sequence coverage • Assembly accuracy • Polymorphism • Sequence accuracy • Annotation accuracy www.bork.embl-heidelberg.de
Comparative genome analysis Prediction of genes and pseudogenes Prediction of genes and pseudogenes Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de
10T 8T NEMAX50 index 6T 4T 2T Number of human genes in time 120 HGS, Incyte and co HGS Textbooks, public opinion 100 80 52 Basis for Feb 01 publications others 60 No human genes in thousands 39 Celera 40 HGP 38 20 32 27 24 22 0 Feb00 Aug00 Oct00 Dec00 Feb01 Apr01 www.bork.embl-heidelberg.de
Hunting for pseudogenes: Homology search of all human intergenic regions HUMAN GENOME Masking for repetitive elements and ENSEMBL sequences 3.3·109 nucleotides 1.4·106 DNA fragments BLASTX vs nr95 prot. db. (cutoff E < e-8) 4.4·104 DNA fragments Filtering of query and database for Low Complexity Regions 3.6·104 DNA fragments BLASTX vs nr95 prot. db. (cutoff E < e-8) 2.3·104 DNA fragments Merging and extension of fragments Construction of gene structure BLASTX vs ENSEMBL database Removal of all virus derived sequences 12526 elements (pseudogenes or genes) with sequence similarity to known proteins
Synonymous/non-synonymous (dS/dN) substitution rates of functional and pseudogenic human sequences 30 Pseudogenes reference set (856 seq.) 25 SWISSPROT (1935 seq.) RefSeq (1103 seq.) 20 % of sequences 15 10 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2 log (dS/dN)
UNCERTAIN Synonymous/non-synonymous (dS/dN) substitution rates of unannotated regions with homology to known genes PSEUDOGENES GENES Analyzed 693 (19%) = 3712 1858 (50%) 1161 (31%) 4321 = 12526 8205 16 Total 14 693 novel genes detected; >4300 expected in ourset 12 10 % of sequences 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2 log (dS/dN)
E value distribution of pseudogenic, uncertain and functional exons 300 (BLASTX vs nr95 database) 250 pseudogenes 200 3712 sequences uncertain functional Number of seons 150 100 50 0 < e-180 e-180 e-160 e-140 e-100 e-60 e-80 e-120 e-20 e-8 e-40 E value
Comparative genome analysis Prediction of genes and pseudogenes Prediction of genes and pseudogenes Homology-based function prediction Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de
Molecular Functions have to be defined on a domain basis i.e. separately for each structurally independent unit within a sequence Henikoff et al. 1997 Science 278, 609
SMART Blast-like input - ID or AC sufficient - Access to different databases - Domain annotation www.smart.embl-heidelberg.de
SMART Digested output -signal sequence -transmembrane regions -comparison of domain context www.smart.embl-heidelberg.de
Domain organization of TAP Random mutagenesis Directed mutagenesis TAP L R R L R R L R R L R R NTF2-like UBA 619aa RNA-binding p15-binding np-bind. NTF2-like p15 100aa www.bork.embl-heidelberg.de Collaboration with Elisa Izaurralde
Directed mutagenesis confirmspredicted TAP/p15 interaction Red - loss of binding Gray - alanine scan Blue - no effect on binding
Top 10 domains* in human Species man fly worm yeast cress Total no genes 6100 13300 18200 25700 26500(26500) Immunoglobulin 765(381) 1 140 64 0 C2H2zinc finger 115 357 151 48 706(607) Protein kinase 1049 319 437 121 575(501) Rhod.-like GPCR 16 97 358 0 569(616) P-loop NTPase 331 198 183 97 433 Rev.transcriptase 80 10 50 6 350 RRM (RNA-binding) 255 157 96 54 300(224) WD40 (G-protein) 210 162 102 91 277(136) Ankyrin repeat 105 107 19 120 276(145) 148 109 9 118 267(160) Homeobox *Only no of genes given, no of domains higher; note that only around 90% is sequenced Nature 409 (01)860; Science 291(01)1304
Comparative genome analysis Prediction of genes and pseudogenes Homology-based function prediction Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de
Function prediction via genomic context information Gene context: - Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’) - Surrounding and shared regulatory elements Knowledge-based context: - Pathway data (can overrule homology!) - Gene expression data (co-expression etc.) - Protein interaction /localisation - Scientific literature www.bork.embl-heidelberg.de
Context methods in Mycoplasma: Fusion, neighborhood, co-occurrence Presence in conserved operons: 213 MG total: 480 genes Fusion 27 54 178 Co-occurrence in genomes Conserved neighborhood www.bork.embl-heidelberg.de
Orthology vs paralogy … within homology paralogy Genome A gene A1 gene A2 orthology Genome B gene B1 gene B2 history gene A1 gene 1 gene B1 gene gene A2 gene 2 www.bork.embl-heidelberg.de gene B2
Exploiting the absence of genes www.bork.embl-heidelberg.de Huynen et al., 1998, FEBS Lett 426, 1-5
Predicting functional interactions between proteins by the co-occurrence of their genes in genomes Distribution of four M.genitalium genes among 25 genomes MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG357(ackA) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 MG305(dnaK) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 Using the mutual information between genes as a scoring heuristic for their co-occurrence. M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase) M(dnaJ, dnaK)=0.55 (heat shock proteins) M(dnaJ, ackA)=0.19 www.bork.embl-heidelberg.de
The phylogenetic distribution of cyaY (frataxin) is identical to that of hscB/Jac1, indicating a functional role of cyaY in iron-sulfur cluster assembly on proteins, specifically in conjunction with Jac1. A . a e B o u l c i c h u n R s e . M S C r p . . y a r A P D X H j j A . n o N P e a . . . . a . . t V e j H. pylori i M r . f E n w e p B . h u n C a a m c . m n . r c a . e a f s n . h d g M u u s a h l z r l e t c i i coli o M i e i l s u u g n e r . n o o t a d c l c n i i . e e b i l k o d i t n o i n y x h e s n t n i S t c o u u t i a i r i S s c o g . z i a i l i s r t b a . i e s p i l d a i a a c s t i e e C a s n o e a i n u e r . t d r c s m a u m i e H l u s s C b b v l . . e s i i o D.melan. e s c a s l i a i p e a s n i g e e s a n n s s (frataxin) cyaY Yfh1 hscB Jac1 hscA ssq1 iscS Nfs1 Huynen et al. Hum.Mol.Genet 2001 iscU Isu1-2 iscA Isa1-2 fdx Yah1 ORF1 ORF2 ORF3 Nfu1 Arh1 Phylogenetic distribution of iron-sulfur cluster assembly proteins
Comparative genome analysis Prediction of genes and pseudogenes Homology-based function prediction Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 1. Co-occurrence of genes Context-based function prediction 2. Gene neighbourhood Context-based function prediction 2. Gene neighbourhood www.bork.embl-heidelberg.de
Genome alignment
Conservation of gene neighboorhood Pairwise comparison of 20 prokaryotic genomes (log) o o o o o o o o o o o o o o o o o o o o o x x x x x x x x x x x x x x x x x x x x x x x x x (time) I I MG-MP EC-HI
Nucleotide salvage/degradation pathway in gram-positive bacteria
STRING server for context retrieval www.bork.embl-heidelberg.de/STRING www.bork.embl-heidelberg.de/STRING Tryptophan biosynthesis
Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesis www.bork.embl-heidelberg.de
Modularity in “genomic association space” tyrA asd aroB truA aroC aroE hemK hyp trpF trpC trpE trpG Shikimate pathway trpA trpD trpB hyp Tryptophan synthesis pathway 2c-rr Networks based on conserved gene neighborhood reveal ‘natural’ subsystems
(pseudo)genes Yan Yuan Mikita Suyama David Torrents
SMART Ivica Letunic Rich Copley
www.bork.embl-heidelberg.de *Frank (D) Yan (C) Peer (D) *Martijn (NL) Tobias (D) *Gert (D) Richard (UK) Shamil (RU) *Luis (E) *Vassily (RU) *Birgit (D) Mikita (J) Miguel (E) *Jörg (D) *left EMBL Warren (US) Berend (NL) David (E), Ivica (Hr), Caroline (E), Steffen(D), Francesca(I),Jan (D), Parantu(In), Christian(D)