500 likes | 658 Views
From First Assembly Towards a New Cyberpharmaceutical Computing Paradigm. Sorin Istrail Senior Director, Informatics Research. Stan Ulam’s Vision. “Don’t ask what Mathematics can do for Biology, ask what Biology can do for Mathematics.”. The Gene Counting Problem.
E N D
From First Assembly Towards a NewCyberpharmaceutical Computing Paradigm Sorin Istrail Senior Director, Informatics Research
Stan Ulam’s Vision “Don’t ask what Mathematics can do for Biology, ask what Biology can do for Mathematics.”
The Gene Counting Problem “The more I see the less I know for sure.” John Lennon
The O.J. Simpson Problem No matter how good the data look, it is full of errors! David Botstein, Albuquerque, 1994
Science and Technology • Genomics • Comparative Genomics • Proteomics • Pharmacogenomics • Structural Genomics • Drug and Vaccine Design • DNA Expression Chips • Animal Models
There is Nothing More Important than the Assembly! Gene Myers and Granger Sutton and the Assembly TeamCelera Human Assembly:Largest Non-Defence Computation20,000 CPU hours -- done 5 times now160 Processors -- CompaqArchitecture
Aq 1997 1998 Completed Microbial Genomes Escherichia coli Helicobacter pylori Bacillus subtilis Chlamydia trachomatis (Mycobacterium tuberculosis CSU#93) Methanobacterium thermoautotropicum Saccharomyces cerevisiae Archaeoglobus fulgidus Mycoplasma pneumoniae (Deinococcus radiodurans) Aquifex aeolicus Haemophilus influenzae 1999 1996 Mycoplasma genitalium Borrelia burgdorferi (Thermotoga maritima) (Rickettsia prowazekii) Pyrococcus horikoshii Synechocystis sp. Methanococcus jannaschii Mycobacterium tuberculosis H37Rv Treponema pallidum
GENOMES SEQUENCED AT TIGR and CELERA Pathogens *Haemophilus influenzae Rd *Mycoplasma genitalium *Helicobacter pylori *Borrelia burgdorferi *Treponema pallidum *Plasmodium flaciparum *Neisseria meningitidis *Chlamydia trachomatis *Chlamydia pneumoniae *Vibro cholerae *Streptococcus pneumoniae *Mycobacterium tuberculosis *Porphyromonas gingivalis *Trypanosoma brucei *Staphylococcus aureus *Enterococcus faecalis *Porphyromonas gingivalis *Chlamydia psittaci Environment *Methanococcus jannaschii *Archaeoglobus fulgidus *Thermotoga maritima *Deinococcus radiodurans *Chlorobium tepidum *Caulobacter crescentus *Shewanella putrafaciens *Desulfovibrio vulgaris *Pseudomonas putida Plants *Arabidopsis thaliana Insects **Drosophila melanogaster Mammals **Human **Mouse * The Institute for Genomic Research ** Celera Genomics
Genesis of Celera August 1998 • New 3700 automated DNA Sequencer changed the sequencing possibility • Combined with TIGR Whole Genome Sequencing Strategy • And 64bit computing
Celera Supercomputing Facility • Celera’s system is one of the most powerful civilian super-computing facilities in the world • Currently over 1.5 teraflop of computing power in a virtual compute farm of Compaq processors with 100 terabytes storage • Next phase a 100 teraflop computer
Obstacles to Genome Sequencing • Sequencing reactions produce short reads (~550bp). Sequence read ~550 bases Human Genome ~3 billion bases GCATTA...GACCGT • The human genome is repeat-rich. CGGATAGACATAAC CGGATAGACATAAC CGGATAGACATAAC CAGCAGCAGCAGCA CAGCAGCAGCAGCA CAGCAGCAGCAGCA Many short reads look identical to each other.
Comparison of Sequencing Strategies Lab-Intense(SLOW) • Mapping and Walking • Mapping and Clone by Clone Shotgun • Whole Genome Shotgun with Mate Pairs Compute-Intense (FAST)
Clone by Clone Shotgun sequencing 1) Replicate mapped spans of DNA. Chromosome Mapped span (BAC) 35,000 2) Shear the replicates randomly and sequence the pieces. cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc cgattc Computed overlaps 3) Assemble reads by overlap matching. Infer the original sequence by consensus. Computed sequence cgattcggattctcgattctacgaa • Mapping and Shotgun
Mate-Pair Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs e.g., 10Kbp ± 8% std.dev. CLONE & END SEQUENCE 590bp 10,000bp
Whole Genome Sequencing Approaches ~27 million reads for Human. • Whole Genome Shotgun Sequencing: 10Kbp 2Kbp 50Kbp • Collect 10-15x BAC inserts and end sequence: ~300K pairs for Human. • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’
Building Scaffolds Mated reads 1 in 1015 that a confirmed pair is in error. Confirmed if at least 2 join the same unitigs and one of them is a U-unitig. 1. 2k-10k Scaffolds: Compute all “unitigs” in graph of U-unitigs connected by confirmed mate links. 2. BAC Scaffolds: Compute all “unitigs” in graph of 2/10K scaffolds connected by confirmed BAC links. Scaffold Sequence or Repeat Gaps (with estimated distances)
The Gene Counting Problem The number probably will be never known exactly. Current estimates: 30,000-40,000 Other estimates: 120,000 • sequence analysis • motif recognition • matches to mRNA • computational predictions • mouse data matches • experimental validation Gene discovery:
Gene Counting Random ESTs from tissues CpG islands (55% of known genes are in CpG islands) Complexity of EST data sets -- sampling biased on tissue and depth of collection Underrepresented in data bases: Low abundance genes, in inaccessible tissues or developmental stages Overrepresented: EST data sets are composed of incomplete sequences of mRNA, and non-overlapping pieces of same mRNA
Drosophila Functional Assignment using Gene Ontology 13,601 Genes
Haemophilus vs. Drosophila • HfluDrosophila X • Genome Size (Mbp) 1.8 120 67 • Sequences 26,000 3,100,000 116 • Months in sequencing 4 4 1 • Sequencing Staff 24 50 2.1 • Assembly Group Staff 1 10 10
Human Genome Sequence from 5 Humans (3 females-2 males) completed • Human sequencing started 9/8/99 • Over 39X coverage of the genome in paired plasmid reads • First Assembly announced June 26 2.9 billion bp • Published in Science, February 16, 2001
Validation Against STS-map • Scaffolds were aligned against the BDGP STS-content map • All scaffolds with spanning 2 or more STSs were checked for order discrepancies. • 16 STS sites out of 2175 (.73%) were out of order, well within the estimated error rate of the STS map. 10 have been determined to be incorrect. 4 3L 3R BDGP STS Order 2L 2R X Celera Scaffold and STS Order
Order & Orientation is Essential to Finding Genes Exon 1 Exon 2 Exon 3 Exon 4 But if contigs are not correctly put together: 1 3 reversed 2 4 Exons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction. Users consistently report finding genes that they can’t find elsewhere.
Contactin-associated protein gene (CNTNAP2)Comparison of genomic DNA sequences retrieved from the public working draft and the Celera database Working draft Celera Genomics 73, 108-112, 2001. http://www.idealibrary.com
Scaffold Sizes 25 Mouse WGA Human WGA 20 Human CSA Scaffold Length (Mbp) 15 scaffold length 10 5 0 % of genome percent of genome coverage
Scaffold Sizes Human WGA 80 Mouse WGA Human CSA 60 Drosophila WGA Scaffold Length (Mbp) Celera-only WGA 40 20 0 % of genome
Human and Mouse Assemblies All Mouse WGA 2,446 19,778212 265,000 96.8 95.5 Mouse WGA 2,367 1,779193 242,000 All C-only WGA 2,781 6,500134 *182,000 99.0 98.7 30K C-only WGA 2,754 537118 174,000 AllSpan (Mbp) Scaffolds Gap (Mbp) Gaps 30K % 100K % WGA 2,847 119,000 261 221,000 90.4 88.6 CSA 2,905 53,000 252 170,000 94.6 92.9 30K Span (Mbp) Scaffolds Gap (Mbp) Gaps WGA 2,574 2,507 240 99,000 CSA 2,748 2,845 224 112,000
The Human Genome is NOT THE Book of Life The Blueprint of Humanity The Language of God The Parts List of Humanity
BLAST, FASTA, and SIM4 Sorin Istrail Celera Genomics
BLAST (BasicLocalAlignment SearchTool) • A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local alignments to a protein or nucleotide query Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.Biol. 215(3):403-10 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NAR25(17):3389-402 (1997) (and references therein)
The BLAST algorithm • Detect all word hits (exact, or nearly identical matches) of a given length between the two sequences • k=10 for nucleotide sequences (exact word matches) • k=3 for protein sequences (nearly identical word matches) • Extend the word hits in both directions to high-scoring gap-free segment pairs (HSPs) • retain only HSPs that score above a threshold • start from the center of the HSP (original BLAST, 1990), or from the center of a pair of HSPs located close to each other on the same diagonal (gapped BLAST, 1997) • Extend the HSPs in both directions allowing for gaps • use dynamic programming, and stop when the alignment score falls more than a threshold X below the best score yet seen • Report all statistically significant local alignments • E-value (starting with BLAST 2.0) is used to measure the statistical significance • E-value = the number of alignments with score equal to or higher than s one would expect to find by chance when searching the database
FASTA • A program for rapid alignment of pairs of protein and DNA sequences, building a local alignment from matching sequence patterns, or words • Algorithm for comparing a query to a database of sequencesFor each database sequence: • Identify the 10 diagonal regions having the largest number of perfect word matches of a given length • word size: k=1,2 for protein, and k=6-10 for nucleotide searches • Re-score these regions using a given scoring matrix (e.g., PAM250), and trim them to form (gap-free) maximal scoring initial regions • Join (non-overlapping) initial regions from adjacent diagonals to generate longer regions, allowing for gaps • Re-score these based on the initial regions’ scores, assessing a penalty for each joining Align the query sequence to each of the sequences in the search set having the highest overall scores • Pearson and Lipman, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci. • USA 85; 2444-2448 (1988).
Sim4 cDNA Intron Intron genomic sequence GT AG 3’ GT AG 5’ Exon 3 Exon 1 Exon 2 • Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a genomic sequence for that gene, allowing for introns and sequencing errors Florea, Hartzell, Zhang, Rubin, Miller, “Acomputer program for aligning expressed DNA and genomic sequences”, Genome Res 8(9):967:74 (1998)
Stages and algorithmic techniques • Detect basic homology blocks • Determine gap-free matches (HSPs) using a ‘blast’-like homology search • Detect all exact word matches of length k (e.g., k=12) • Extend the word hits in both directions, by substitutions, to gap-free high-scoring segment pairs (HSPs) • Retain only HSPs scoring above a threshold • Connect the HSPs to form larger blocks (‘exon cores’) using sparse dynamic programming • Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA sequence • Extend the similarity blocks using fast greedy sequence comparison algorithms • Detect new exon cores with the ‘blast’-like homology search tuned for higher sensitivity • Refine the introns • Predict the locations of splice junctions using a combined measure of the accuracy of alignment and the intensity of splice signals at the ends of each intron • Generate the spliced alignment • Align the sequences within individual exons using greedy alignment algorithms • Connect the chain of exon alignments by gaps (introns)