• 640 likes • 940 Views
Hybrid error correction and de novo assembly of single-molecule sequencing reads. Presented by George Roberts III.
E N D
Hybrid error correction and de novo assembly of single-molecule sequencing reads Presented by George Roberts III Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin, Jason T Howard, GaneshkumarGanapathy, ZhongWang, David A Rasko, W Richard McCombie, Erich D Jarvis & Adam M Phillippy nature biotechnology NATURE BIOTECHNOLOGYVOLUME 30 NUMBER 7 JULY 2012
Human Language • Double articulation • Complex expressions can be broken down into morphemes and words • Source code can be tokenized
Vocalization in Chimpanzees • Limited vocalization • Pant hoot (excitement) • Food anticipation • Males, females – most common in α-males • Difficult and expensive to study http://www.cjclandandseaphoto.com/etanzania11.htm
Cast of Characters • Erich Jarvis • Duke Neuroscientist • Uses the zebra finch and budgie as a “simple” model of vocalization • Birds are small and easy to breed • Sergey Koren • Celera Assembler • AMOS, metAMOS: assembly • pacBioToCA: (correction and assembly pipeline) • Andy Phillipy • Assemblathon
A Genomic Panorama Phage λEscherichia coli Budgerigar Yeast Bob Duda, University of Pittsburgh Dreamtime Dennis Kunkel Britannica.com wikipedia Corn (RNA seq) Zebra Finch (RNA seq)
Parakeets • Small to medium sized parrots (order psittaformes) • One of few vocal species • Crows (corvidae) also intelligent • Cavity nesters • Southern hemisphere & tropics Macaw (not a parakeet) Image credit: Luc Viatour
Melopsittacusundulatus - Budgerigar • undulatus [L.] wavy pattern • Native to Australia • Little Sexual Dimorphism • Both parents care for young • Mating pairs allopreen • Males have blue ceres • 1.23GB (www.genomesize.com) • 2.8 GB assembly??? (Pre!Ensembl – Jarvis) – database error!
Taeniopygiagutatta – Zebra Finch - Passeriformes • Teanio [L.] means striped, guttata [L.] means spotted or dappled • Jarvis Lab intramural volleyball team – TeanyPyggies • Introduced to Portugal, Puerto Rico, Brazil, US. • Sons learn their fathers songs with little variation (females do not sing) • Songs may change during puberty, but are locked in place thereafter • 1.2Gb Sanger assembly (Warren et al. 2010) • (Warren, Clayton, Ellgren & Arnold + Jarvis Mardis)
PBcR Read Correction and Assembly • Resolve repeats through careful alignment • Eliminate spurious mappings (white) • Use top alignments to correcterrors in PacBio reads • Errors remain where short reads have same error as PacBio reads
Longer Read Length Improves Assembly • Simulated data based on: • Even coverage and average read lengths from actual data • Error correction rate of 99.9% (76-bp reads) • 10x PacBio coverage produces optimal assembly
Illumina Paired-End Sequencing • Inserts of 200-500bp • Sequence with SP1 • Sequence with SP2
Circular Consensus Sequence (CCS) • Read length = 1 / Coverage • Makes use of 29 rolling circle http://smrt.med.cornell.edu/
S288C Coverage By Chromosome I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI mito. • Top 10 alignments > 1kb were mapped with BLASR • Depth tallied by 1kb bins • Claim: spikes caused by mapping artifacts • PacBioremoves amplification bias and reduces G+C bias.
Sequencing Depth Histogram • Poisson λ = 12.5 • Fatter tails • Bias? • mapping artifact?
k-mer Size to detect Overlaps: E. coli Simulated Uncorrected PacBio
CDF of Correct Overlaps vs. Mismatch Tolerance Cumulative % Overlaps E = 0.16 + 0.16 – 0.162 ~= 31.55% % Overlap Error
454 vs. PacBio Overlaps Cumulative % Overlaps % Overlap Error
FLX+ • 1kb reads (700bp mode) • Consensus accuracy 99.997% (15x coverage) • 1,000,000 reads per run • GS FLX Titanium chemistry
Illumina Coverage vs. N50 • 200x Datapoint: • random Illumina errors are common enough to align with PacBio errors • 4.86% drop in uncorrected N50 corresponds to a 20% drop in corrected N50 • Not recommended! • Aggressive trimming (Quake) reduces the chimera rate to 1.86%, eliminating the drop • % Chimera increases with coverage “Sweet spot”
Contiguity is Correlated with Read Length Low complexity N50 normalized to Genome Size Low coverage Average Read length
PacBio Coverage vs. Correction Methods • De Bruijn thrives on high coverage, OLC can be hindered by high coverage
Melopsittacusundulatus assembly • A hybrid assembly of the 454 and Illumina data was not possible because Celera Assembler does not support high-coverage Illumina data and ALLPATHS-LG does not support 454. • ALLPATHS-LG assembles smaller contigs but scaffolds contain additional 1-2% of transcript bases (makes excellent use of short reads)
Melopsittacusundulatus assembly • 40% of [zebra finch] transcripts in the unstimulated auditory forebrain are noncoding and derive from intronic or intergenicloci • 92% of 454-PBcR-Illumina closed gaps are outside of coding regions • 18% within introns • 74% between “gene models”
Sequence is from opposite strands and in opposite directions
Illumina-Corrected PacBio Assembled by 10kb Illumina mate pairs
PBcR Join Lengths agree with Scaffold Estimates • 33,881 scaffold gaps • 16,251 (48%) closed by 454-PBcR • 17,290 (51%) closed by 454-PBcR-Illumina • 11,804 (35%) closed by both • Half not closed by either!
TaeniopygiaguttatamRNA-CDS Mapping • 15,275 zebra finch mRNA from NCBI • 81, 83, 86 and 85 hybrid mappings respectively
Assembly – Gap Statistics • Vast majority of gaps are outside of exons
Avian Vocalization Regions Area X: Basal Ganglia RA: premotor nucleus NXIIts: hypoglossal HVC: high vocal center DLM: dorsolateral division of the medial thalamus LMAN: lateral part magnocellular nucleus
Genomics of Vocalization • Large involvement of ncRNA (Mattick 2004, Warren 2010)
Forkhead Box P2 - FOXP2 • DNA-binding protein • Poly [Q] (activation) • Zinc-finger (DNA-Binding) • Leucine-zipper (dimerization) • Required for proper brain and lung development
Forkhead Box P2 - FOXP2 • Knockout mice pups exhibit less vocalization • Abnormalities in Purkinje layer • Death ~ 21 days (lung development) • 400kb • Bat echolocation • Extremely diverse (conserved in all other mammals) • Upregulated in T. guttata vocalization regions • Mutations in human cause severe speech disorders despite adequate intelligence • Underactivation of Putamen & Broca’s area http://vanat.cvm.umn.edu/neurHistAtls/pages/cns9.html
FOXP2 Human - Chimp • N303 and S325 • N303 unique to humans • Relatively few intronic mutations (recent sweep) • Zhang et al. 2002 Genetics 162:1825-1835 positive selection
FOXP2 • Knockdown impairs song imitation • Sequence differences do not affect learning • Canaries relearn their songs each year • Order Passeriformes (finches) • FoxP2 levels increase in late summer and early fall Fisher and Scharf (2009) TIGS 25:166-177
Zebra Finch Genome - 2010 • 2nd avian genome (after chicken) • Erich Jarvis, Elaine Mardis (Warren 2010) Singing-correlated gene expression