660 likes | 883 Views
Next-generation sequencing: informatics & software aspects. Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009. New sequencing technologies…. … offer vast throughput … & many applications. 100 Gb. Illumina/Solexa , AB/ SOLiD sequencers.
E N D
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009
… offer vast throughput … & many applications 100 Gb Illumina/Solexa, AB/SOLiD sequencers (10-50Gb in 25-100 bp reads) 10 Gb 1 Gb Roche/454 pyrosequencer (100Mb-1Gb in 200-450bp reads) bases per machine run 100 Mb 10 Mb ABI capillary sequencer 1 Mb 10 bp 100 bp 1,000 bp read length
Genome resequencing for variation discovery SNPs short INDELs structural variations • the most immediate application area
Genome resequencing for mutational profiling Organismal reference sequence • likely to change “classical genetics” and mutational analysis
De novo genome sequencing Lander et al. Nature 2001 • difficult problem with short reads • promising, especially as reads get longer
Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) • natural applications for next-gen. sequencers
Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 • high-throughput, but short reads pose challenges
Transcriptome sequencing: expression profiling Cloonan et al. Nature Methods, 2008 Jones-Rhoads et al. PLoS Genetics, 2007 • high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays
IND (ii) read mapping (iv) SV calling (iii) SNP and short INDEL calling IND (i) base calling (v) data viewing, hypothesis generation The re-sequencing informatics pipeline REF
The variation discovery “toolbox” • base callers • read mappers • SNP callers • SV callers • assembly viewers
Base error characteristics vary 454 Illumina
Read lengths vary 25-60 (variable) 25-50 (fixed) 25-100(fixed) ~200-450 (variable) 400 100 200 300 0 read length [bp]
Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers
… and they give you the picture on the box Read mapping is like a jigsaw 2. Read mapping …you get the pieces… Uniquepieces are easier to place than others…
Multiply-mapping reads • “Traditional” repeat maskingdoes not capture repeats at the scale of the read length • Reads from repeats cannot be uniquely mapped back to their true region of origin
Paired-end (PE) reads fragment length: 1 – 10kb fragment length: 100 – 600bp PE reads are now the standard for whole-genome short-read sequencing Korbelet al. Science 2007
The MOSAIK read mapper Michael Strömberg • gapped mapper • option to report multiple map locations • aligns 454, Illumina, SOLiD, Helicos reads • works with standard file formats (SRF, FASTQ, SAM/BAM)
Alignment post-processing • quality value re-calibration • duplicate fragment removal
Alignment visualization • too much data – indexed browsing • too much detail – color coding, show/hide
SNP calling: old problem, new data sequencing error polymorphism
Allele calling in next-gen data SNP New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection INS
SNP calling in multi-sample read sets -----a----- -----a----- -----c----- -----c----- P(G1=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(G1=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(G1=ac|B1=aacc; Bi=aaaac;Bn= cccc) P(B1=aacc|G1=aa) P(B1=aacc|G1=cc) P(B1=aacc|G1=ac) -----a----- -----a----- -----a----- -----a----- -----c----- Prior(G1,..,Gi,.., Gn) P(Gi=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(Gi=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(Gi=ac|B1=aacc; Bi=aaaac;Bn= cccc) P(Bi=aaaac|Gi=aa) P(Bi=aaaac|Gi=cc) P(Bi=aaaac|Gi=ac) -----c----- -----c----- -----c----- -----c----- P(Bn=cccc|Gn=aa) P(Bn=cccc|Gn=cc) P(Bn=cccc|Gn=ac) P(Gn=aa|B1=aacc; Bi=aaaac; Bn=cccc) P(Gn=cc|B1=aacc; Bi=aaaac;Bn= cccc) P(Gn=ac|B1=aacc; Bi=aaaac;Bn= cccc) “genotype likelihoods” “genotype probabilities” P(SNP)
Trio sequencing • the child inherits one chromosome from each parent • there is a small probability for ade novo (germ-line or somatic) mutation in the child
Alignment visualization • too much data – indexed browsing • too much detail – color coding, show/hide
Standard data formats SRF/FASTQ GLF/VCF SAM/BAM
Human genome polymorphism projects common SNPs
The 1000 Genomes Project Pilot 1 Pilot 2 Pilot 3
1000G Pilot 3 – exon sequencing • Targets: • 1K genes / 10K targets • Capture: • Solid / liquid phase • Sequencing: • 454 / Illumina • SE / PE • Data producers: • Baylor • Broad • Sanger • Wash. U. • Informatics methods: • Multiple read mapping & • SNP calling programs
On/off target capture ref allele*: 45% non-ref allele*: 54% Target region SNP (outside target region)
Reference allele bias ref allele*: 54% non-ref allele*: 45% (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346
SNP calling findings • based on a method comparison / testing exercise • 80 samples drawn from the 4 Centers • read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)
Overlap between call sets Broad calls BC calls # SNP calls: # dbSNPs: % dbSNPs: Ts/Tv ratio: 413 24 5.81% 0.23 452 172 38.05% 1.21 2,296 1,862 81.10% 3.40
Structural variation detection Feuket al. Nature Reviews Genetics, 2006
SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]
Read Depth: good for big CNVs Detection Approaches Reference Sample • Paired-end: all types of SV Lmap • Split-Readsgood break-point resolution read contig • deNovo Assembly~ the future SV slides courtesy of Chip Stewart, Boston College
Single molecule sequencing? GC Bias Coverage bias
CN = 3 CN = 2 CN = 1 RD resolution Illumina observed read counts (per kb) density (log10) expected read counts ( per kb)