340 likes | 620 Views
Current Sequencing Technologies and Data Generation. Corbin Jones & Piotr Mieczkowski Department of Biology, College of Arts and Sciences, Carolina Center for Genome Sciences Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill.
E N D
Current Sequencing Technologies and Data Generation Corbin Jones & Piotr Mieczkowski Department of Biology, College of Arts and Sciences, Carolina Center for Genome Sciences Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill
Next-generation Sequencing (Deep Sequencing) Platforms • Short reads • Genome Analyzer IIx (GAIIx), HiSeq2000, HiSeq2500, MiSeq – Illumina • SOLiD 5500xl System – Applied Biosystem • HeliScope™ Single Molecule Sequencer - Helicos • Long reads • Genome Sequencer FLX System (454) – Roche • PacBio RS - Pacific Bioscience • Personal Genome Machine, Ion Proton - Ion Torrent • GridION – Oxford Nanopore • Mapping sequences to large DNA fragments • NABsys • Bionanomatrix
UNC – HTSF • 9 HiSeq 2000/2500 • 1 GA II • PacBio • Ion Torrent • MiSeq (Jeff Dangl) Liz Buda and Donghui Tan Also on campus: 454 (Microbiome) 454 jr. (Viral genomics) MiSeq – Kevin Weeks
What type of sequencing should I choose for the Illumina sequencing project? • HiSeq 2000/2500 – 100-160mln single end sequencing reads per lane. • - ChIPseq – Single End 50 cycles (2-3 human samples per lane) • - RNAseq – Single End 50 cycles (2-3 human samples per lane) • If you are interested in splicing variants and fusion genes both Single End 100cycles and Paired End 2x50cycles will be better option for you. • Whole Genome Sequencing – Paired End 2x100cycles (2-3 lanes per genome) • Exome Capture - Paired End 2x100cycles (4 samples per lane) • MiSeq – 3-7 mln single end sequencing reads per lane. Custom projects , fast turnaround. • Metagenomics - 16S profile – Paired End 2x150cycles up to 24 samples per lane. • Whole Microbial Genome Sequencing - Paired End 2x150cycles
SHORT READ PLATFORMS at UNC HiSeq 2000 Initially capable of up to 600Gb per run in 13 days. Cost of resequencing one human genome: Now UNC PI - (30x coverage) about $6,000 Now for outside of UNC - (30x coverage) about $9,000 HiSeq 2500 Initially capable of up 100Gb per run in 27hours. Cost per genome - ???
MiSeq • Small capacity system. PE 2x150cycles in 27hours. • PE 2 x 250bp coming soon – error rate for read 1 – less than 1%; read 2 about 1.2%. • In preparation – PE 2 x 400bp – error rate for read1 about 2%; read 2 about 4%. • In preparation – Longer insert size possible 1.5kb
PacBio RS • Single molecule resolution in real time • Short waiting time for result and simple workflow • Generate basecalls in <1 day • Polymerase speed ≥1 base per second • No amplification required • Bias not introduced • More uniform coverage • Direct observation • Distinguish heterogeneous samples • Simultaneous kinetic measurements • Long reads • Identify repeats and structural variants • Less coverage required • Information content • One assay, multiple applications • Genetic variation (SVs to SNPs) • Methylation • Enzymology • C2 chemistry – installed March 2012 • Long reads 6-10kb • Meidan size of molecules 3kb • Still 15% error rate • No strobe sequencing • Software focus on: • De novo assembly • Hi quality CCS consensus reads • In preparation • Load long molecules by magnetic beads • Modified nucleotides detection
Standard Sample Preparation Circular Consensus PacBio RS – two sequencing modes LS – long sequencing reads • Large insert sizes (2kb-10kb) • Generates one pass on each molecule sequenced CCS – high quality sequencing reads • Small insert sizes 500bp • Generates multiple passes on each molecule sequenced
Example Data: 1 smart cell Pre-Filter # of Bases 180,320,136 bp Post-Filter # of Bases 165,424,592 bp Pre-Filter # of Reads 75153 Post-Filter # of Reads 52801 Pre-Filter Mean Readlength 2399 bp Post-Filter Mean Readlength 3133 bp Pre-Filter Mean Read Quality 0.624 Post-Filter Mean Read Quality 0.827 % Adapter Dimer (0-10bp) 1.94 % % Short Insert (11-100bp) 0.47 %
Personal Genome Machine – Ion Torrent (life technologies) Three types of semiconductor chips: 314 – 20Mb 316 - 200Mb 318 – 1Gb Read length depends on base composition 200-250bp (200cycles) System is enabled for Paired End 2x100cycles The fastest sequencing system on the market. How it works: H+ ion is released during base incorporation. Individual polymerases attached to beads are positioned in tiny wells that rest on a tiny pH meter. • Recommendation: • Resequencing applications which require fast turnaround of samples • - Amplicons (PCR products) • Small and medium size genomes • Custom DNA capture applications
PGM/Ion Torrent Data 316 chip Thr. Total Number of Bases [Mbp] 77.65 ‣ Number of Q17 Bases [Mbp] 36.11 ‣ Number of Q20 Bases [Mbp] 27.33 Total Number of Reads 368,860 Mean Length [bp] 211 Longest Read [bp] 380
Library Preparation from Low Quantities of DNA or RNA Microfluidics stationary and portable systems Mondrian SP System – NuGEN Technologies • Human libraries from 5ng of total DNA. Only 10-15% of duplicate reads. • Ultralow DNA library systems • Soon: • Ultralow RNA library systems • Libraries from total RNA with rRNA depletion. Advanced Liquid Logic from RTP
Emerging Sequencing Technologies Semiconductor sequencing chip Nanopore / Nanochannel sequencing
Ion Proton System • Human genome in one day • Cost of reagents $1000 per run • Error rate around 1.2% • Human Genome, RNAseq, ChIPseq Ion Proton Chip I – 10Gb (Whole Exome capture experiments) Ion Proton Chip II – 100Gb Whole human Genome resequencing
Oxford Nanopore – new view on sequencing Hemolysin – pore - inner diameter of 1nm, about 100,000 times smaller than that of a human hair.
Oxford Nanopore DNA sequencing Error rate 4%, prediction for end of the year 0.1 – 2%.
Oxford Nanopore – new concepts MinION • - 150Mb per run • - Tested 48kb read length • $900 per instrument • 500 pores per device GridION • - XXXMb per run • - Tested 48kb read length • $XXX per instrument • 2000 pores per device, soon 8000 pores • Cost per human genome $1500.
Oxford Nanopore – applications • DNA sequencing • Protein detection • Protein DNA interaction • Small molecule detection • 96 well plates for 96 samples • Controlled time of sequencing
Intelligent BioSystems Mini20 System (manufactured by Azco Biotech) • Amplification by rolony method • Sequencing by Synthesis with announced 100 base reads, but expect to compete with Sanger down the road • Designed for clinical labs • 20 independent flow cells, no queue for loading, run asynchronously • 20M reads/flow cell, 4 GB/ flow cell • Potential problems with repeats • System cost $120K, $150 flow cell (disposable), full costs per sample not clear yet. • Entering early access now, expect commercial shipping late 2012
Genia Technologies • Very early stage announcement – Backed by Life Technologies (at least 1 year away) • Describe system as a cross between Ion Torrent and Oxford Nanopore • Electronic “Active Control” technology enables highly efficient nanopore-membrane assembly and control of DNA movement through the channel • Initially used α-Hemolysin and claimed 98% raw accuracy with that but now are using an undisclosed pore for further development. • Claim sensitivity 1-2 orders of magnitude greater than Oxford Nanopore. • Ramping up pore density to 100K pores/chip by end of 2012. • Plan to market a mobile reader for <$1K and per sample costs <$100 • Plan early access in late 2012, commercial shipment 2013
Basic RNAseq • Type 1: Description of trancriptome • Assembly of transcripts/isoforms • Annotation of genes • Type 2: “Paired” e.g. treatment vs control • Differential expression • Differential transciption • Type 3: Population • Elements of 1 and 2, but “random effects” • TCGA roughly fall into this category
Strand Specific RNAseq • Perkins et al 2009, Levin et al 2010 • Goal: To mark the RNA molecules in order to know the direction of transcription. • differentiate anti-sense transcripts, lncRNAs, mRNAs etc. • Many methods, dUTP may be best, Illumina has kit
End tagged RNAseq • GOAL: Identify ends of transcripts by attaching adaptors to ends of mRNAs • can be used in strand specific protocols • can be used in annotation and assembly protocols AAAAAA mG
Normalized RNAseq • GOAL: To even the distribution of transcripts sequenced • Reduce the representation of high abundance transcripts and increase sensitivity to low abundance
Normalized RNAseq 2 • Methods • Kinetic (Patanjali et al 1991, Bonaldo et al 1996) • dsDNA nuclease (Zhulidov 2004) • Cap-Trapper (Carninci et al 2000) • Results • Abundant transcripts reduced proportional to freq • Coverage still proportional to expression • Problems: bias, contamination w/ ncRNA
Total RNAseq • Goal: Sequence every RNA molecule in the cell • Observe: unspliced RNAs, small RNAs, non-coding RNAs, tRNAs • Must remove rRNA! • Variants: Nuclear only, cytoplasmic only, mRNA removal
small RNA • GOAL: Small RNAs are important for gene regulation, synthesis, splicing, and immunity (miRNA/miR, snRNA, snoRNAs, scaRNAs) • Several protocols (e.g. Illumina, Morin et al 2010) • All involve size selection, which can lead to bias • Produce short sequences that are then mapped back to the genome. • Aside, seem more Poisson like than other counts
RIPseq/CLIPseq/HITS-CLIP • GOAL: Identify the sites on the RNA where RNA binding proteins are bound. • e.g. Components of the spliceosome • protocol is similar to ChIPseq except there is a random hexamer ds-cDNA synthesis step • refs: Khalil et al 2009, Sanford 2009, Licatalosi 2008, Zhang and Darnell 2011