1 / 41

bork@embl.de bork.embl-heidelberg.de/

Proteome analysis in silico. Peer Bork EMBL & MDC Heidelberg & Berlin. bork@embl.de http://www.bork.embl-heidelberg.de/. ‘omes: use and misuse. Original intention exemplified by the genome:. ‘ome – entirety of biomolecular objects (ALL genes etc).

misu
Download Presentation

bork@embl.de bork.embl-heidelberg.de/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteome analysis in silico Peer Bork EMBL & MDC Heidelberg & Berlin bork@embl.de http://www.bork.embl-heidelberg.de/

  2. ‘omes: use and misuse Original intention exemplified by the genome: ‘ome – entirety of biomolecular objects (ALL genes etc) ‘omics – research on an entirety of biomolecular objects Proteomics – research on the entirety of proteins (so far in an organism) coined beginning of the 90th Common Praxis: ‘omics - used to describe large-scale approaches (whereby large is sometimes 1) Proteomics - used for research on many proteins (whereby many might mean 3)

  3. Originally two main directions: Protein profilingandinteraction proteomics Protein profiling: establishment of protein inventories under controlled conditions (organelles, tissues, organisms). Interaction proteomics: identification of temporally and spatially defined functional modules formed by proteins Bioinformatics analysis is essential in both areas

  4. Proteome analysis in silico Part I Protein detection and annotation by homology and orthology (function in1D) Part II Protein interactions and protein networks (function in 2D) Temporal and spatial considerations (function in 3D+4D)

  5. Bork et al. JMolBiol 1998 Genome annotation Alternative Splicing Domain analysis Protein networks Literature mining coupled to genomic data

  6. 70% prediction accuracy is great!

  7. Concepts in function prediction Homology-based(intrinsic molecular features) - Sequence and domain DBs (Blast, Pfam,Smart) - Function transfer by orthology Gene context (functional associations) - Gene neighbourhood, fusion, co-occurrence - Shared regulatory elements Other (residue level, functional class ) - Correlated mutations - Interaction threading - Feature analysis

  8. I. Homology-based protein annotation Homology detection and domain annotation Homology detection and domain annotation Metazoan genome annotation: the dark side… Metazoan proteome analysis: human vs chicken Evolution of protein function www.bork.embl-heidelberg.de

  9. Status of homology based function prediction Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction

  10. Molecular Functions have to be defined on a domain basis i.e. separately for each structurally independent unit within a sequence Henikoff et al. 1997 Science 278, 609

  11. History of signaling domain discovery Systematic discovery by 1) searching ‘in between’ regions 2) starting with repeats Doerks et al. 2002 Genome Res. Ponting et al. 2001 Genome Res.

  12. Domain discovery in disease genes

  13. SMART Blast-like input • Access to • different • databases • Domain • annotation & • architecture • Alerting Collaboration with Chris Ponting www.smart.embl-heidelberg.de

  14. SMART Digested output -signal sequence, Coiled coil and TM -Pfam integrated -comparison of domain context www.smart.embl-heidelberg.de

  15. RSK-like protein • Similar to ribosomal protein S6 kinase • Calpain7 MIT MIT MIT MIT MIT MIT • CG8866 A putative transport-associated microtubule-binding domain Unifying disorders associated to hereditary spastic paraplegia? Mutation Spartin Plant-related MIT • Spastin • SKD1 protein • VPS4p ATPase (Vacuolar protein sorting factor 4A and 4B) • Tobacco mosaic virus helicase domain-binding protein • Sorting nexin 15 Patel, H. et al. Nat Genet 31(02)347, Ciccarelli, F. D., et al. Genomics 81(03)437

  16. I. Homology-based genome annotation Homology detection and domain annotation Homology detection and domain annotation Metazoan genome annotation: the dark side… Metazoan genome annotation: the dark side… Metazoan proteome analysis: human vs chicken Evolution of protein function www.bork.embl-heidelberg.de

  17. 10T 10T 8T 8T TecDAX index NEMAX50 index 6T 6T 4T 4T 2T 2T Number of human genes in time 120 HGS, Incyte and co HGS Textbooks, public opinion 100 80 52 Basis for Feb 01 publications others 60 39 No human genes in thousands Celera 40 HGP 38 20 32 27 24 22 21 0 Feb00 Aug00 Oct00 Dec00 Feb01 Apr01 Jan05

  18. cyp2j13 cyp2j6 cyp2j9 cyp2j5 Known genes ESTs Manual (8genes) Twinscan(1 gene) GeneID(3 genes) fgenesh++ (13 genes) ENSEMBL (9 genes) Improvement of gene cluster predictions Mouse chr4:94-94,6 Mb p450 (CYP2J) region: 8 genes / 11 pseudogenic fragments (comparison performed in 2004)

  19. 2 4 5 9 0 2 0 7 8 4 4 5 4 0 2 . . . . . . . . . . . . . . . 0 3 3 9 7 7 1 2 7 3 0 6 9 8 3 5 6 5 5 5 6 6 6 8 6 6 6 5 6 6 ) ) ) ) ) ) ) ) ) ) ) ) = = = = = = = = = = = = = = = 5 1 3 3 5 9 3 5 1 8 8 0 ) 9 % 9 % 9 % % 9 % 7 % 6 % 3 % 7 % 9 % % % 3 % 0 % 0 % 1 ) 9 2 2 1 8 1 0 7 1 2 4 1 7 3 ) d d d d d d d d d d d d d d d 0 5 8 4 2 0 8 2 8 5 7 7 9 6 2 i i i i i i i i i i i i i i i 2 9 7 5 6 0 3 6 8 9 6 2 6 2 7 2 2 3 4 4 5 5 5 5 2 9 7 0 1 1 1 8 1 2 0 6 6 4 7 0 8 1 2 6 8 . . . . . . . . . . . . 3 7 1 . . . . . . . . . . . . 0 2 3 4 1 8 4 7 5 9 7 9 8 8 4 7 4 7 6 0 . . 3 2 6 3 9 7 1 0 5 2 1 7 6 . . 6 7 8 9 1 1 1 1 1 7 1 2 3 4 5 9 4 9 4 9 0 9 9 1 5 9 4 9 1 1 . 3 4 7 2 8 5 5 2 9 6 4 3 1 4 . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ . . . . . . . . . . . . . . . 3 5 9 3 7 7 4 8 1 6 5 4 9 2 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E E E E E E E E E E E E E E E 1 1 1 2 1 2 4 5 2 6 1 6 9 9 6 = = = = = = = = = = = = = = = N N N N N N N N N N N N N N N 3 8 4 0 9 6 6 1 5 7 4 7 5 7 2 v v v v v v v v v v v v v v v 1 1 2 3 3 4 4 5 5 5 2 4 3 8 1 E E E E E E E E E E E E E E E o o o o o o o o o o o o o o o ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( G G G G G G G G G G G G G G G ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ c c c c c c c c c c ~ ~ ~ ~ ~ c c c c c ) 1 0 5 = 400 n e l ( 300 p e p 200 . j 2 p y c 100 . m M BLAST2GENE finds independent gene copies BLAST of cyp2j13 protein vs. Mouse chr4:94-94,6 Mb ~ 150 Alignments BLAST2GENE 5482 7674 2499 9960 9502 2772 355 294 355 733 816 294 600 644 775 248 383 362 986 986 2662 5482 2662 2161 4259 6354 2524 5704 4957 3955 1978 1262 6286 9024 2563 8844 5074 3089 7684 2403 4717 3443 2412 3180 8678 1863 5482 1988 3280 2111 3613 9547 7380 2960 1772 3522 1656 3839 1549 5141 9639 3289 1452 5270 3289 1452 5270 22025 12983 25664 20546 22025 10328 10646 18576 19633 12288 12983 25664 20546 19731 22780 19940 16451 14587 13029 23116 20352 15275 14703 13461 11826 11826 Hundrets often considerable differences to current gene prediction pipelines!

  20. Masking of known repeats and already predicted genes BLASTX vs nr prot. db E-value < 0.001 Exclusion of transposon and virus derived sequence 1.5-2 million fragments fragments with significant sequence similarity Merging of fragments of the same element regions containing independent elements Closest known protein (first blast hit) Ca 20.000 detectable pseudogenes in each: human, mouse, rat GENEWISE Ka/Ks functionality check Annotation of pseudogenes changes gene numbers 1. Similarity search in intergenic regions Torrents, Suyama, Bork Genome Res. 13(2003)2550

  21. Processed Pseudogene Genewise prediction using sptrembl|Q9HBM5 Processed Pseudogene Genewise prediction using SwissProt|RS2_RAT 80 kb e1 e2 e3 e4 e5 e6 Predicted Gene Mm chr1:7608644-7681026 Stop codon or frameshift Annotation of pseudogenes changes gene numbers 2. Consistency check of gene predictions Still >3000 pseudogenes among the predicted human genes mid 2004 (build 34) Arrays, chips et al. 20%off?

  22. genes What do we count? 20-40k genes >100k transcripts >1000k proteins? Protein diversity

  23. Rate of detectable alternative splicing depends on EST coverage and library range 2.8 2.7 2.6 2.5 2.4 AS per mRNA (x) 2.3 2.2 2.1 2.0 Brett et al.Nature Genet. 30(2002)29

  24. www.bork.embl-heidelberg.de Boue et al. Bioessays 03

  25. Homology-based predictions of exons and alternative transcripts (www.smart.embl-heidelberg.de) SMART domain DB links to genomes

  26. Top 10 domains* in human: 30% diff.! Species human fly worm Total no genes 13300 18200 26500(26500) Immunoglobulin 765 (381) 140 64 C2H2zinc finger 357 151 706 (607) Protein kinase 319 437 575 (501) Rhod.-like GPCR 97 358 569 (616) P-loop NTPase 198 183 433 Rev.transcriptase 10 50 350 RRM (RNA-binding) 157 96 300 (224) WD40 (G-protein) 162 102 277 (136) Ankyrin repeat 105 107 276 (145) 148 109 267 (160) Homeobox *Only no of genes given, no of domains higher; note that only around 90% is sequenced Nature 409 (01)860;Science 291(01)1304

  27. Metazoan genome annotation an ongoing process and far from complete • >2000 pseudogenes in mammalian gene sets: Only now they are about to be included in prediction pipelines • Ca 150 retro-related genes in mammalian gene sets (>1000 in 2004), but true human genes sometimes suppressed • Annotation of gene clusters need considerable improvements • Alternative splicing still a major unknown • Considerable human factor in annotation

  28. I. Homology-based genome annotation Homology detection and domain annotation Metazoan genome annotation: the dark side… Metazoan genome annotation: the dark side… Metazoan proteome analysis: human vs chicken Metazoan proteome analysis: human vs chicken Evolution of protein function www.bork.embl-heidelberg.de

  29. human 5 chimp 75 mouse 40 rat 310MY chicken 450MY fugu 600-1200MY? C.eleg. D.mena. ? 250MY mosquito Human: Nature Feb 2001 Mosquito: Science Oct 2002 Mouse: Nature Dec 2002 chicken: Nature Dec 2004 Rat: Nature Apr 2004

  30. Chickengenome analysis Hillier et al Nature 04 Zdobnov et al Science 02 15% 45%

  31. Chickengenome analysis: orthology and cellular processes 75.4% identity (median) between chicken and human 1:1 orthologs Immune response evolves fastest

  32. Chicken genome analysis: Innovation and Expansion of domain families www.bork.embl-heidelberg.de

  33. Orthology analysis reveals more subtle functional changes

  34. Evolution by duplication: Burst of an olfactory receptor family …thought to recognize MHC diversity chicken …221 copies in chicken human …given a ca 300 ORs in chicken and 450 in human

  35. Chicken genome analysis: Evolution of function by domain accretion Scavenger receptor cysteine-rich domain acquired by a fibrinogen-domain containing protein (identified and displayed by SMART)

  36. I. Homology-based genome annotation Homology detection and domain annotation Metazoan genome annotation: the dark side… Metazoan proteome analysis: human vs chicken Metazoan proteome analysis: human vs chicken Evolution of protein function Evolution of protein function www.bork.embl-heidelberg.de

  37. Phylogenetic Distribution of orthologs - Losses

  38. Gene loss in diptera D A P Y W H M

  39. Functional changes at evolutionary time scales Orthologs mapped onto metazoan phylogeny

  40. Summary (homology-based function prediction) Emphasis in homology based genome annotation shifts from sensitivity (e.g. domain identification) to selectivity issues (orthology assignment for 1:1 function transfer) Metazoan genome annotation is far from being complete and caution is needed when using incomplete and partially erroneous parts list (e.g. when predicting networks) Yet, with the incoming number of metazoan genomes our understanding of functional diversification at the protein level will increase dramatically ....although the proteome remains far from being deciphered

More Related