170 likes | 337 Views
Delon Toh. Pitfalls of 2 nd Gen. Amplification of cDNA Artifacts Biased coverage Short reads Medium ~100bp for Illumina 700bp for 454. 3 rd Gen: PacBIO RS. Real-time single molecule sequencer No amplification = no bias Produces long reads Median ~2246 Max~23000
E N D
Pitfalls of 2nd Gen • Amplification of cDNA • Artifacts • Biased coverage • Short reads • Medium ~100bp for Illumina • 700bp for 454
3rd Gen: PacBIO RS • Real-time single molecule sequencer • No amplification = no bias • Produces long reads • Median ~2246 • Max~23000 ** Resolve complex repeats and span entire gene transcript (No need complex computational assembly)
3rd Gen Problem • 82.1%-84.6% nucleotide accuracy • Point mutations + deletions Results in: • Pairwise differences between two reads is approximately twice their individual error rate • >5-10% error rate what most genome assemblers can tolerate
Length of single-molecule PacBio reads = size distribution of most transcipts • PacBio reads will represent full-length/near full-length transcript • No need complex algorithms (e.g Trinity) for short reads, in order to detect spliced isoforms • Predominance of indel errors makes analysis of raw read problematic
Solution for high error rate • Generate short, high-accuracy sequences • Correct the error inherent in long, single molecule sequences • Assemble corrected ‘hybrid’ reads
Black: Errors Error between 2 Long pink reads: 16% each Error between 1 Pink and 1 blue bar: 16% x ~1% Rare sequencing error will propagate when only sequencing error co-occurs Trimming and splitting long reads (high indel %) whenever there is a gap in short read tiling
Correction accuracy/performance • Using short reads from Illumina to correct Pacbio reads for each reference organism • Lambda NENB3011 • E. coli • S. cerevisiae S228c • ~85% to 99% accuracy of long reads • PBcR PacBio Corrected Reads
Correction and Throughput • Reads may be discarded (low abundance RNA?) • Low quality • Short length • % of reads that are successfully corrected = Throughput • ~60%
Single-molecule RNA-seq correction: Zea mays • 0.06% insertion, 0.02% deletion rate • 99.1% aligned to reference genome by BLAT at >90% sequence identity • 50130 PacBio reads • Median size: 817 • 11.6% aligned to reference genome by BLAT at >90% sequence identity Correction
Many PacBio reads represented close to full-length transcripts • Post-correction sequences have virtually no errors and precisely identified splicing junctions
Summary • 2nd Gen Sequencing • Short fragment, • Accurate • Requires complex algorithm (eg, Cufflink, Trinity) to piece the short fragments into meaning full transcript • 3rd Gen Sequencing • Long fragment • Inaccurate • Combine the long reads from 3rd gene and accuracy from 2nd gene • Short fragment used to correct long fragment Long and accurate read • Computational input used to correct rather than assemble
Future • Correction was optimized for genome assembly and applied for RNA-seq • Direct RNA-seqby changing polymerase to RNA-polymerase?