1 / 46

Computational Analysis of Genome Sequences

Computational Analysis of Genome Sequences. Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University. 1995: 1st genome ( H. influenzae , TIGR) 1996: 1st eukaryote ( S. cerevisiae ) 2000: 29 complete microbial genomes 22 in progress at TIGR

neola
Download Presentation

Computational Analysis of Genome Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

  2. 1995: 1st genome (H. influenzae, TIGR) • 1996: 1st eukaryote (S. cerevisiae) • 2000: 29 complete microbial genomes • 22 in progress at TIGR • 50+ in progress worldwide • 3 complete eukaryotes • yeast, nematode, fruit fly • 2 major projects in 2000: • Human (3.3 billion bp) • Arabidopsis thaliana (125 million bp) The Genomics Revolution

  3. Organism (genome size) Reference Haemophilus influenzae (1.83 Mb) Fleischmann et al., Science 269, 496-512 (1995). Mycoplasma genitalium (0.58 Mb) Fraser et al., Science 270, 397-403 (1995). Methanococcus jannaschii(1.7 Mb) Bult et al., Science 273, 1058-73 (1996). Helicobacter pylori(1.6 Mb) Tomb et al., Nature 388, 539-47 (1997). Archeoglobus fulgidus (2.1 Mb) Klenk et al., Nature 390, 364-70 (1997). Borrelia burgdorferi(1.5 Mb) Fraser et al., Nature 390, 580-6 (1997). Treponema pallidum(1.1 Mb) Fraser et al., Science 281, 375-88 (1998). Plasmodium falciparum chr2 (1 Mb) Gardner et al., Science 282, 1126-32 (1998). Thermotoga maritima (1.8 Mb) Nelson et al., Nature 399, 323-9 (1999). Deinococcus radiodurans(3.3 Mb) White et al., Science 286, 1571-7 (1999). Arabidopsis thaliana chr2 (19 Mb) Lin et al., Nature 402, 761-8 (1999). Neisseria meningitidis (2.3 Mb) Tettelin et al., Science 287, 1809-15 (2000). Chlamydia pneumoniae (1.2 Mb) Read et al., Nucleic Acids Res 28, 1397-406 (2000). Chlamydia trachomatis (1.0 Mb) Read et al., Nucleic Acids Res 28, 1397-406 (2000). Vibrio cholerae (4.0 Mb) Heidelberg et al., Nature, in press. Mycobacterium tuberculosis(4.4 Mb) Fleischmann et al., manuscript in preparation Streptococcus pneumoniae(2.2 Mb) Tettelin et al., manuscript in preparation Caulobacter crescentus (4.0 Mb) Nierman et al., manuscript in preparation Chlorobium tepidum (2.1 Mb) Eisen et al., manuscript in preparation Porphyromonas gingivalis (2.2 Mb) Fleishmann et al., manuscript in preparation Genomes Completed at TIGR

  4. Genomes in progress at TIGR Organism (genome size) Funding source Plasmodium falciparum chr 14 (3.4 Mb) BWF/DoD Plasmodium falciparum chr 10,11 (4 Mb) NIAID/DoD Trypanosoma brucei chr 2 (1 Mb) NIAID Enterococcus faecalis (3.0 Mb) NIAID Mycobacterium avium (4.4 Mb) NIAID Pseudomonas putida (6.2 Mb) DOE Schewanella putrefaciens (4.5 Mb) DOE Staphylococcus aureus (2.8 Mb) NIAID, MGRI Dehalococcoides ethenogenes (1.5Mb) DOE Desulfovibrio vulgaris (3.2Mb) DOE Thiobacillus ferrooxidans (2.9 Mb) DOE Chlamydia psittaci GPIC (1.2Mb) NIAID Bacillus anthracis (5.0Mb) ONR/DOE/NIAID Treponema denticola (3.0 Mb) NIDR C. hydrogenoformans (2.0 Mb) DOE Methylococcus capsulatus (4.6 Mb) DOE Geobacter sulfurreducens (4.0 Mb) DOE Wolbachia sp (Drosophila) (1.4 Mb) NIH Colwellia sp (1.0 Mb) DOE Mycobacterium smegmatis (4.0Mb) NIAID Staphylococcus epidermidis (2.5 Mb) NIAID Theileria parva (10Mb) ILRI/TIGR

  5. Gene finding TIGR Assembler Genome scaffold Library construction Homology searches Combinatorial PCR POMP Colony picking Initial role assignments Template preparation Ordered contig set Metabolic pathways Gene families Sequencing reactions Gap closure sequence editing Comparative genomics Base calling Re-assembly Transcriptional/ translational regularory elements Repetitive sequences Sequence files ONE ASSEMBLY! A Microbial Genome Sequencing Project Random sequencing Genome Assembly Annotation Data Release Publication www.tigr.org Sample tracking

  6. Gene Finding • Gene finding plays an ever-larger role in high-speed DNA sequencing projects • There’s no time for much else! • 1000’s of genes generated each month at a high-throughput sequencing facility • Separate gene finders are needed for every organism • Training on organism X, finding genes on Y, generates inferior results • Bootstrapping problem: training data is hard to find

  7. Open Reading Frames: 6 possibilities TCG TAC GTA GCT AGC TAG CTA AGC ATG CAT CGA TCG ATC GAT T CGT ACG TAG CTA GCT AGC TA A GCA TGC ATC GAT CGA TCG AT identical sequence TC GTA CGT AGC TAG CTA GCT A AG CAT GCA TCG ATC GAT CGA T

  8. GLIMMER: A Microbial Gene Finder • GLIMMER 2.0: released late 1999 • > 200 site licenses worldwide • Works on bacteria, archaea, viruses too • Malaria (eukaryotic) version: GLIMMERM • Refs: Salzberg et al., NAR, 1998, Genomics 1999; Delcher et al., NAR, 1999 • Web site and code: http://www.tigr.org/

  9. Uniform Markov Models • Useconditional probability of a sequence position given previous k positions in the sequence. • Fixed, kth-order model: bigger k ‘s yield better models (as long as data is sufficient). • Probability (score) of sequence s1 s2 s3 … snis:

  10. Uniform Markov Models • Advantages: • Easy to train. Count frequencies of (k+1)mers in training data. • Easy to assign a score to a sequence. • Disadvantages: • (k+1)mers can be undersampled; i.e., occur too infrequently in training data. • Models sequence as fixed-length chunks, which may not be the best model of biology.

  11. Interpolated Markov Models • Use a linear combination of 8 different Markov chains; for example: • c8P (g|atcagtta) + c7P (g|tcagtta) + … • + c1P (g|a) + c0P (g) • where c0 + c1 + c2 + c3 + c4 = 1 • Equivalent to interpolating the results of multiple Markov chains • Score of a sequence is the product of interpolated probabilities of bases in the sequence

  12. IMM’s vs. Fixed-Order Models • Performance: • IMM should always do at least as well as fixed-order. • E.g., even if kth-order model is correct, it can be simulated by (k+1)st-order • Our results support this. • IMM result can be used as fixed-order model. • IMM slightly harder to train and uses more memory.

  13. IMM Training • Problem: How to determine the weights of all the thousands of k-mers? • Traditionally done with E-M algorithm using cross-validation (deleted estimation). • Slow. • Overtraining can be a problem.

  14. GLIMMER IMM Training • Our approach assumes: • Longer context is always better • Only reason not to use it is undersampling in training data. • If sequence occurs frequently enough in training data, use it, i.e., l = 1 • Otherwise, use frequency and c2 significance to set l.

  15. How GLIMMER Works • Three separate programs: • long-orfs: automatically extract long open reading frames that do not overlap other long orfs. • IMM model builder. Takes any kind of sequence data. • Gene predictor. Takes genome sequence and finds all the genes.

  16. Gene Predictor • Finds & scores entire ORF’s. • Uses 7 competing models: 6 reading frames plus “random” model. • Score for an ORF is the probability that the “right” model generated it. • 3-periodic Markov model • High-scoring ORF’s are then checked for overlaps.

  17. Glimmer 2.0 IMM design Pos -1 Context a t c g ATGCATGATCGAG Pos -3 Pos -2 Pos -3 Pos-3 12bp Pos -3 Pos -4 Pos -3 Pos -3 8 levels deep

  18. Better Overlap Resolution

  19. Better Overlap Resolution

  20. GLIMMER 2.0’s Performance Organism Genes Genes Additional Annotated Found Genes H. influenzae 1738 1720 (99.0%) 250 (14%) M. genitalium 483 480 (99.4%) 81 (17%) M. jannaschii 1727 1721 (99.7%) 221 (13%) H. pylori 1590 1550 (97.5%) 293 (18%) E. coli 4269 4158 (97.4%) 824 (19%) B. subtilis 4100 4030 (98.3%) 586 (14%) A. fulgidis 2437 2404 (98.6%) 274 (11%) B. burgdorferi 853 843 (99.3%) 62 (7%) T. pallidum 1039 1014 (97.6%) 180 (17%) T. maritima 1877 1854 (98.8%) 190 (10%)

  21. GLIMMER 2.0 on known genes Organism Genes Known Correct Annotated Genes Predictions H. influenzae 1738 1501 1496 (99.7%) M. genitalium 483 478 476 (99.6%) M. jannaschii 1727 1259 1256 (99.8%) H. pylori 1590 1092 1084 (99.3%) E. coli 4269 2656 2632 (99.1%) B. subtilis 4100 1249 1231 (98.6%) A. fulgidis 2437 1799 1786 (99.3%) B. burgdorferi 853 601 600 (99.8%) T. pallidum 1039 755 747 (98.9%) T. maritima 1877 1504 1493 (99.3%) Average (99.3%)

  22. Speed • Training for 2 Megabase genome: < 1 minute (on a Pentium-450) • Find all genes in 2Mb genome: < 1 minute • Impact:GLIMMERwas used for: • B. burgdorferi (Lyme disease) , T. pallidum (syphilis) (TIGR) • C. trachomatis (blindness,std) (Berkeley/Stanford) • C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF) • T. maritima, D. radiodurans, M. tuberculosis, V. cholerae, S. pneumoniae, C. trachomatis, C. pneumoniae, N. meningitidis (TIGR) • X. fastidiosa (Brazilian consortium) • Plasmodium falciparum (malaria) [GlimmerM] • Arabidopsis thaliana (model plant) [GlimmerM] • Others: viruses, simple eukaryotes, more bacteria

  23. Self-Similarity Scans • Idea: analyze a whole genome by counting 3-mers in all 6 frames • Analyze small windows (2000 bp, 10000bp) using the same statistic • Algorithm: • Build model of entire sequence • Applythe 2 statisticto compare windows to the genome itself

  24. Haemophilus influenzae (meningitis) 2 GC%

  25. Thermotoga maritima (hyperthermophile)

  26. Vibrio cholerae (cholera)

  27. On the other side of CTXfprophage is a region encoding an RTX toxin (rtxA) and its activator (rtxC) and transporters (rtxBD). A third transporter gene has been identified that is a paralog of rtxB, and is transcribed in the same direction as rtxBD. Downstream of this gene are two genes encoding a sensor histidine kinase and response regulator. Trinucleotide composition analysis suggests that the RTX region was horizontally acquired along with the sensor histidine kinase/response regulator, suggesting these regulators effect expression of the closely linked RTX transcriptional units. --Heidelberg et al., Nature, in press.

  28. MUMmer • Aligns 2 complete genomes • Maximal Unique Matches • Suffix trees • Very fast alignment of very long DNA sequences • Ref: Delcher et al., Nucl. Acids Res., 1999 • Software at: http://www.tigr.org/softlab

  29. The Problem • Efficiently compute alignments between long sequences to identify biologically interesting features. • E.g., two strains of M. tuberculosis,each ~4.4MB • E.g., two versions of a genome at different stages of closure • Compute alignment in less than 2 minutes

  30. Maximal Unique Sequences • Sequences in genomes A and B that: • Occur exactly once in Aand in B • Are not contained in any larger such sequence

  31. Select the longest consistent set of MUMs • Occur in the same order in A and B

  32. Suffix Trees • A tree with edges labelled by strings • Labels of child edges of a node begin with distinct letters • Each leaf L represents a sequence—the labels on the path to L from the root • Holdsallsuffixes of a set of sequences • Asuffixis a subsequence that extends to the end of its sequence • The suffix tree for sequences A and B : • Contains less than 2(|A| + |B|) nodes. • Can be constructed in O (|A| + |B|) time! • Still need lots of RAM • All the analyses here were run on a desktop PC

  33. Analyze the gaps between adjacent MUMs • Small gaps can be aligned with Smith-Waterman algorithm • Large gaps can be aligned recursively • Large inserts can be searched for separately. Many will be inconsistent MUMs • Overlapping MUMs indicate variation in copy number of small repeats

  34. a MUM M. tuberculosis CSU93 vs. H37Rv A C G TA 66 164 9C 48 81 169G 164 89 44T 11 159 61

  35. M genitalium vs. M. pneumoniae

  36. H. pylori 26695 vs. J99

  37. V. cholera (forward) vs. E. coli Origin

  38. V. cholera (reverse) vs. E. coli

  39. V. cholera (both strands) vs. E. coli: a puzzle?

  40. V. cholera vs. itself

  41. S. pyogenes vs. S. pneumoniae

  42. S. pyogenes vs. itself

  43. M. leprae vs M. tuberculosis M. tuberculosis M. leprae

  44. X-alignments: how? 4 3 3 4 5 2 2 5 1 1 6 6 Ori 3 4 2 5 6 1 4 3 3 4 2 5 5 2 6 6 1 1

  45. Chr 2 vs. Chr 4 of Arabidopsis thaliana: discovery of a 4 Mb duplication 1100 genes 430 (39%) duplicated

  46. Acknowledgements • GLIMMER, GLIMMERM • Arthur Delcher, Simon Kasif, Owen White, Mihaela Pertea • MUMmer • Arthur Delcher, Simon Kasif, Jeremy Peterson, Rob Fleischmann, Owen White • Analyses • Numerous TIGR faculty and staff, including: Jonathan Eisen, Owen White, Rob Fleischmann, Hervé Tettelin, Tim Read, Maria Ermolaeva, John Heidelberg, Ian Paulsen, Malcolm Gardner, Claire Fraser, Clyde Hutchison, ... • Supported by: • National Institutes of Health (NHGRI, NLM) • National Science Foundation (CISE, BIO)

More Related