Announcements

Announcements • Lab tomorrow morning at 9:30 • For Tuesday: read wooly mammoth genome paper and bring a printed version of three questions/topics for discussion • Discussion leader selection

Sequencing and Assembly The Relic, 1997 GEN875, Genomics and Proteomics, Fall 2007

Questions • What do we sequence? • Why do we sequence it? • How do we sequence it?

Explosive Growth in Sequencing 8/22/2005 Press Release: INSD (GenBank, EMBL, DDBJ) reaches 100 Gigabase milestone

What do we sequence? • Genomes (de novo, resequencing) • Metagenomes or complex samples • Transcripts

NCBI Genomes http://www.ncbi.nlm.nih.gov/Genomes/ Comparison of data from 9/5/07, 9/4/06 and 8/31/05 • Eukaryotic Genomes: • Complete 25, 22, 20 • Assembly 162, 109, 72 • In progress 235, 299, 166 • Prokaryotic Genomes: • Complete 567, 371, 254 • In progress 841, 615, 433

Why do we sequence genomes?

Why do we sequence genomes? • To catalog all the genes present in one organism. • To compare the gene content of one organism to another organism. • To study features other than genes. • To study genome evolution. • To study organismal evolution. • As a foundation for future experimentation.

Complete vs. Draft genomes? • To catalog all the genes present in one organism. • To compare the gene content of one organism to another organism. • To study features other than genes. • To study genome evolution. • To study organismal evolution. • As a foundation for future experimentation.

DNA sample Sequencing Center JCVI/TIGR Broad Institute JGI Joint Genome Institute Washington University Baylor College of Medicine Sanger Centre Etc. GATCGATCGATC… + Annotation

Strategies • Clone and sequence (select or random) clones using Sanger sequencing • Massively parallel pyrosequencing (454) • “proprietary Clonal Single Molecule Array technology and novel reversible terminator-based sequencing” (Solexa)

Method Comparison

Library construction and sequencing High-throughput Steps Bacterial Culture Isolate DNA Cycle Sequencing Physical fragmentation Size selection Isolate cloned constructs Ligate randomly into vectors Pick and grow individual colonies Transformation Plate on agar

Library construction and sequencing High-throughput Steps What if you can’t grow it? Is it homogeneous? Success rate ? Bacterial Culture Isolate DNA Cycle Sequencing Right organism ? Is it random ? Physical fragmentation Size selection Isolate cloned constructs Ligate randomly into vectors Pick and grow individual colonies Is it toxic ? Transformation Plate on agar Vector choice? Number of libraries?

Cloning Vectors and Insert Size Ranges lambda phage, cosmids, and some plasmids can hold inserts up to about 40 Kb M13 and some plasmids, 0.5 to 3 Kb BAC (bacterial artificial chromosome) vectors can hold inserts of 50-200 Kb Sequencing “Read” ~800 bp Sequencing Primer

Sequencing Reaction Components Screenshot from NHGRI Educational Tool Kit Application on DNA Sequencing

Fragments terminated at each base Screenshot from NHGRI Educational Tool Kit Application on DNA Sequencing

Sequence Detection - + Screenshot from NHGRI Educational Tool Kit Application on DNA Sequencing

A DNA Sequence – maybe 800 bp long Screenshot from NHGRI Educational Tool Kit Application on DNA Sequencing

Figure 20.0 from Biology, Sixth Edition by Campbell and Reece

Phred Scores Quality of Phred Score Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 100000 99.999%

Assembly of the Individual Sequences Individual sequencing reads are compared to each other and where they overlap can be assembled to create contigs

Assembly of the Individual Sequences Keep adding individual sequencing reads to build larger and fewer contigs

Assembly of the Individual Sequences Eventually all contigs merge to a single consensus sequence for each chromosome.

Assemblers • Greedy Assemblers – compare all reads to each other then join them in order of overlap size Figure 8. Greedy assembly of four reads.

Assemblers • Overlap Graph Assemblers – make a graph where each node represents a read and edges between them represent overlaps. Figure 9. Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Assemblers • Eulerian Path Assemblers - also use graph theory but break each read into component “k-mers” • Comparative Assemblers – use a closely related genome to infer order and orientation of reads • Many assemblers now have options to skip or delay assembly of repetitive sequences

Popular Assemblers • TIGR Assembler (TIGR) • Phrap (Wash U) • Celera Assembler (Celera, TIGR) • Arachne (MIT Broad) • Phusion (Sanger – uses Phrap) • Atlas (Baylor HGSC)

How many clones should we sequence? …according to the work of Lander and Waterman (1988), the number of “islands” or contigs formed from randomly collected sequences depends on: G = Genome Length L = Sequence Read Length N = Number of Sequences Collected T = Number of Basepairs of Overlap Needed (- )) LN G T L ( 1 - # Islands = Ne

5 Mbp Genome, 500 bp reads, 25 bp overlap • # reads coverage % sequenced # contigs • 2500 0.25 22.12 1971 • 5000 0.5 39.35 3109 • 10000 1 63.21 3867 • 20000 2 86.47 2991 • 30000 3 95.02 1735 • 40000 4 98.17 895 • 50000 5 99.33 433 • 60000 6 99.75 201 • 70000 7 99.91 91 • 80000 8 99.97 40 • 90000 9 99.99 17 • 100000 10 100.00 7

Graph of previous data

Genome size as predicted from the assembly 2500 2000 1500 Predicted 5.5 Mb size Observed # non-singletons 1000 Predicted 3.7Mb size 500 0 20,000 0 10,000 30,000 40,000 50,000 60,000 70,000 Shotgun Sequencing Model # non-singleton contigs # Sequences

Dual Ended Sequencing Can Provide Information to Link Contigs Sequencing with primers that begin in the vector on either side of the insert yields about 800 bp of DNA sequence from each end of the insert 5 Kb insert The middle of the insert is never sequenced for most clones used in the project Primer A Primer B

Assembly with dual-ended sequencing Sequence assembly Contigs joined by overlaps Contigs linked by a spanning clone Scaffold – two or more linked contigs

Gap Closure Strategies • Primer walk to sequence the rest of linking clones that span a scaffold gap • Primer walk off clones at the ends of contigs for which there is no linking information • PCR based on your best guess at contig order (comparison to other closely related genomes, predicted genes at the end of genomes, anything else you can come up with) • Combinatorial PCR with primers designed at the end of each contig

Three Case Studies Involving “Random” WGS • What if it isn’t really random? Pantoea stewartii • What if the sequence is highly repetitive? Escherichia coli O157:H7 island Human genome project segmental duplications

Pantoea stewartii subspecies stewartii • Enterobacteria similar to E. coli • Important pathogen of sweet corn and maize • Approximately 5 Mb genome

Reads used in the Assembly = 84,277 Coverage = 12x Contigs > 2,000 bp = 479 Average Contig Size = 9,277 bp A handful of contigs have very high coverage (100-fold plus) What went wrong? Pantoea stewartii Genome Assembly

Pantoea stewartii Genome Structure • Approximately 5 Mb genome 10-13 plasmids ranging from 5-100 kb circular chromosome Copy number per cell varies from 1 to ~100

E. coli O157:H7 genome Lingering gap in the genome sequence

E. coli O157:H7 genome assembly problem : large repetitive sequence Two exact copies of an 89 Kb “island” Lingering gap in the genome sequence

Optical Restriction Mapping of O157:H7

Resolution of the O157:H7 Assembly • WGS sequencing and assembly • Optical restriction maps for two different enzymes • Optical maps conflicted in this region • No contigs left in the assembly had either restriction pattern • A region already assembled had a predicted restriction pattern that matched one of the optical maps in two locations

Celera Human Genome Project Whole Genome Shotgun Sequencing (WGS) Construct clone libraries to generate templates for sequencing Consortium Human Genome Project Directed Strategy Create large insert clone libraries Map the position of each clone relative to others Choose an optimal tiling path Subclone chosen large insert clones to generate templates for sequencing Overall Genome Project Strategies

History of Human Genome Sequencing • 1990 – Public Consortium Human Genome Project began • 1999 – Celera begins human WGS sequencing • 2001 – Both the Consortium and Celera publish analyses of human genome drafts • 2003 – Consortium declares sequencing more complete • Oct. 2004 – Consortium declares “completion” again U.S. Department of Energy Genome Programs, Genomics and Its Impact on Science and Society, 2003

Nature, Oct. 21 2004

Draft 150,000 gaps Missing ~10% of euchromatin Near-Complete 341 gaps ~99% of euchromatin covered Error rate = 1 in 100,000 2.85 billion nucleotides Comparison of HGP AssembliesDraft (2001) and Near-Complete(2004) Sequences

Dotplot of versions of chromosome 7

Announcements

Announcements

Presentation Transcript

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements