340 likes | 490 Views
Today. Please read… S cience 291: 1304-1315. Human Genome Project Dissenters My Brush with Greatness?. 1992 : Two years into the HGP, two of the projects biggest critics were…
E N D
Today • Please read… Science 291: 1304-1315
Human Genome Project DissentersMy Brush with Greatness? • 1992: Two years into the HGP, two of the projects biggest critics were… • Sydney Brenner: believed that the HGP should focus on human EST collections, and sequence the genome of a simple vertebrate (Fugu). • Craig Venter: believed that the clone-by-clone approach was not the most efficient way to proceed, suggested that shotgun approaches, and even a whole genome approach was feasible. …they were both right.
Sydney Brenner 2002 Nobel Prize (Medicine/Physiology) Sydney Brenner and John E. Sulston, Britain H. Robert Horvitz, United States • for discoveries concerning how genes regulate organ development and a process of programmed cell death.
Brenner was right…. Expressed Sequence TagsESTs End sequenced cDNAs (complementary DNA) • cDNA: synthetic DNA transcribed from a mRNA template, • through the action of an RNA dependant DNA polymerase called reverse transcriptase. Online Primer: est.html
Still Sequencing cDNAs, • first and easiest look into any genome, • useful in understanding genomic sequence (gene finding), • helps determine splice site variants, • shorter than genomic clones, fits in plasmids, • etc.
Used for microarrays… …an array of DNA that can be hybridized with probes to study patterns of gene expression. …tissue specific ESTs are very useful.
Venter was right…. J. Craig Venter Whole Genome Assembly • 1995: 1.8 Mbp Haemophilus influenza genome sequenced, • 1996 - on : Mycoplasma, E. coli and others*, • 1999: Chromosome 2 of Arabidopsis, • 2000: Drosophila (120 Mbp) genome, …Human, Mosquito, etc… • Lots of genomes, several applications... *WGA of bacterial, viral populations...
1 year, 120 megabases, • Assembly algorithms could generate accurate genomic sequences, • Interim assemblies (or mapping) were not necessary. 24 MARCH 2000 VOL 287 SCIENCE
Think About This… …the plasmid library construction is the first critical step in WGA sequencing, • “if the DNA libraries are not uniform in size, non-chimeric, and do not randomly represent the genome, then the subsequent steps cannot accurately reconstruct the genome sequence.” • “We used automated high-throughput DNA sequencing and the computational infrastructure to enable efficient tracking of enormous amounts of sequence information (27.3 million sequence reads; 14.9 billion bp of sequence).”
Who’s DNA? • 21 enrolled donors, • age, sex, ethnographic group, • one African-American, • one Asian-Chinese, • one Hispanic-Mexican, • two Caucasions*.
J. Craig Venter Who’s Mostly?
…back to humans… Individuals, Libraries, Sequence coverage, Clone coverage, Other? What to know? 543 bp average sequence read 8, September 1999 - 25, June 2000
Online Primer: snps.html WGA Outline
DNA in sized libraries… sequencing primers insert vector DNA sequence in mate-pairs… cartoons sequenced ends ~543 bp unsequenced insert ~ known size = 5’- actgtacgtgtagctgaca… - 3’ 5’- tagcgtagttattttgc… - 3’ = 5’- actgtacgtgtagctgaca actgtacgtgtagctgaca - 3’
…back to humans… Individuals, Libraries, Sequence coverage, Clone coverage, Other? What to know? 543 bp average sequence read 8, September 1999 - 25, June 2000
What does Shredder Do? Why? Whole Genome Assembly 1. Screener 2. Overlapper 3. Unitigger/Discriminator, 4. Scaffolder, 5. Repeat Resolver.
atgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtgaatgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga read: atgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga masked: atgacttacttactgcatatttatttatttatttatttatttatttatttatttatttatttatttatttatttatttgacgtgtacgtgtacgtgtagctgtacgtgtacgtgacgggccgcattatcgtgatgctacgtgtacgttatatctgatcgtgcatgtga marked: Screener ...finds and “masks” microsatellite repeats, known repeated regions and ribosomal DNA, • “masked” regions not used to make contigs, • “marks” the rest for overlapping.
<--tactgtacgtagctgtgatgttcctcggatatagcgggcatatttattacgctattgtacgtgt-3’<--tactgtacgtagctgtgatgttcctcggatatagcgggcatatttattacgctattgtacgtgt-3’ 5’- gttcctcggatatagcgggcatatttattacgctattgtacgtgtaaagtatcgt--> > 40 bp, < 6% mismatch Overlapper ...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match, What’s the significance? ...a one in 1017 event. …given perfect randomness.
Good News ... uniquely assembled contigs (unitigs) are readily identifiable, • all of the assembled sequences match over all of the known sequence, - and - ...are consistent with an 8x sequence coverage.
What does Shredder Do? Why? Whole Genome Assembly 1. Screener 2. Overlapper 3. Unitigger/Discriminator, 4. Scaffolder, 5. Repeat Resolver.
...contig cluster is consistent with expected size (+8), ...no dissimilar sequences between any members. Unitigs But(t): • ...the Screener doesn’t include all of the “low frequency” level repeats, • ...so, a majority of the Overlapper outputs turned out to be bogus.
What Now? • “over-collapsed” assemblies are identified and broken down into unitigs when possible... • …these “too-large” contig sets are sent to the Unitigger/Discriminator.
...in a world where real data matches expected data, each locus would have 8X coverage, ...if there are genomic repeats, then sequences would be “over-represented”, on average, 8 more per repeat, per contig. ...over-collapsed. Unitigger...differentiates between a true overlap, and an overlap that includes more than one loci.
...parses the “over-collapsed” contig by using sequence outside of the overlap region Discriminator
Discriminator ...may yield u-unitigs. Unitigger/Discriminator Output: correctly assembled contigs covering 73.6% of the genome.
Scaffolder ...contigs the contigs, • uses mate-pair information, two or more consistent mate-pair matches yields 1 in 1010 odds of being chance.
confirm matches Repeat Resolver...most of the remaining gaps were due to repeats. • “Rocks” • Use “low Discriminator Value” contig sets to fill gaps, • - find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 107), • “Stones” • - find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.
Repeat Resolver...most of the remaining gaps were due to repeats. • “Rocks” • Use “low Discriminator Value” contig sets to fill gaps, • - find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 107), • “Stones” • - find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.
...make sequencing primer from BES... If that Doesn’t Work ...find a mate-pair that spans the gap, and sequence it, Chromosome Walking
Today/Friday • Questions about WGA, • CSA, • Comparisons, • Quality Control, etc.