190 likes | 295 Views
Informatics for next-generation sequence analysis – SNP calling. Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008. Read length and throughput. Illumina/Solexa, AB/SOLiD short-read sequencers. 1Gb. (1-4 Gb in 25-50 bp reads). bases per machine run. 100 Mb.
E N D
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008
Read length and throughput Illumina/Solexa, AB/SOLiD short-read sequencers 1Gb (1-4 Gb in 25-50 bp reads) bases per machine run 100 Mb 454 pyrosequencer (20-100 Mb in 100-250 bp reads) 10 Mb ABI capillary sequencer 1Mb read length 10 bp 100 bp 1,000 bp
Current and future application areas • Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP • De novo genome sequencing • Short-read sequencing will be (at least) an alternative to micro-arrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)
3. Alignment of billions of reads Fundamental informatics challenges (I) 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non-uniqueness in the genome: resequenceability
Informatics challenges (II) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management
Read mapping Read alignment Paralog identification SNP detection + inspection Resequencing-based SNP discovery genome reference sequence
SNP calling workflow • read alignment • SNP detection • visual checking
A A A A A C C C C C G G G G G T T T T T polymorphic combination monomorphic combination Bayesian posterior probability i.e. the SNP score Base call + Base quality Polymorphism rate (prior) Base composition Depth of coverage Bayesian detection algorithm
base quality values help us decide if mismatches are true polymorphisms or sequencing errors • accurate base qualities are crucial, especially in lower coverage Base quality values for SNP calling
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA individual 2 strain 3 AACGTTAGCATA AACGTTAGCATA individual 3 Priors for specific resequencing scenarios
A A/C C C/C A A/A Consensus sequence generation (genotyping) AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA strain 3 AACGTTAGCATA AACGTTAGCATA individual 3
iso-1 reference 46-2 454 read 46-2 ABI reads (2 fwd + 2 rev) • 92.9 % validation rate (1,342 / 1,443) • 2.0% missed SNP rate (25 / 1247) SNP calling in low 454 coverage DNA courtesy of Chuck Langley, UC Davis • with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) • 10 different African and Americanmelanogaster isolates • 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total) • can we detect SNPs in survey-style 454 read coverage?
SNP calling in short-read coverage • SNP calling error rate very low: • Validation rate = 97.8% (224/229) • Conversion rate = 92.6% (224/242) • Missed SNP rate = 3.75% (26/693) SNP • INDEL candidates validate and convert at similar rates to SNPs: • Validation rate = 89.3% (193/216) • Conversion rate = 87.3% (193/221) INS C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs)
SNP calling in AB/SOLiD color-space reads A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error
Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis reference sequence Image from JGI web site • collaboration with Doug Smith at Agencourt • Pichia stipitis converts xylose to ethanol (bio-fuel production) • one mutagenized strain had especially high conversion efficiency • determine where the mutations were that caused this phenotype • we resequenced the 15MB genome with 454 Illumina, and SOLiD reads • 14 true point mutations in the entire genome • In about 15X nominal coverage each technology can find every point mutation with essentially no false positives
Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release
Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab