Informatics for next-generation sequence analysis – SNP calling

Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

Read length and throughput Illumina/Solexa, AB/SOLiD short-read sequencers 1Gb (1-4 Gb in 25-50 bp reads) bases per machine run 100 Mb 454 pyrosequencer (20-100 Mb in 100-250 bp reads) 10 Mb ABI capillary sequencer 1Mb read length 10 bp 100 bp 1,000 bp

Current and future application areas • Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP • De novo genome sequencing • Short-read sequencing will be (at least) an alternative to micro-arrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)

3. Alignment of billions of reads Fundamental informatics challenges (I) 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non-uniqueness in the genome: resequenceability

Informatics challenges (II) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management

Read mapping Read alignment Paralog identification SNP detection + inspection Resequencing-based SNP discovery genome reference sequence

SNP calling workflow • read alignment • SNP detection • visual checking

A A A A A C C C C C G G G G G T T T T T polymorphic combination monomorphic combination Bayesian posterior probability i.e. the SNP score Base call + Base quality Polymorphism rate (prior) Base composition Depth of coverage Bayesian detection algorithm

base quality values help us decide if mismatches are true polymorphisms or sequencing errors • accurate base qualities are crucial, especially in lower coverage Base quality values for SNP calling

AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA individual 2 strain 3 AACGTTAGCATA AACGTTAGCATA individual 3 Priors for specific resequencing scenarios

A A/C C C/C A A/A Consensus sequence generation (genotyping) AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA strain 3 AACGTTAGCATA AACGTTAGCATA individual 3

SNP calling in Roche/454 pyrosequences

iso-1 reference 46-2 454 read 46-2 ABI reads (2 fwd + 2 rev) • 92.9 % validation rate (1,342 / 1,443) • 2.0% missed SNP rate (25 / 1247) SNP calling in low 454 coverage DNA courtesy of Chuck Langley, UC Davis • with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) • 10 different African and Americanmelanogaster isolates • 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total) • can we detect SNPs in survey-style 454 read coverage?

SNP calling in Illumina/Solexa short-reads

SNP calling in short-read coverage • SNP calling error rate very low: • Validation rate = 97.8% (224/229) • Conversion rate = 92.6% (224/242) • Missed SNP rate = 3.75% (26/693) SNP • INDEL candidates validate and convert at similar rates to SNPs: • Validation rate = 89.3% (193/216) • Conversion rate = 87.3% (193/221) INS C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs)

SNP calling in AB/SOLiD color-space reads A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error

Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis reference sequence Image from JGI web site • collaboration with Doug Smith at Agencourt • Pichia stipitis converts xylose to ethanol (bio-fuel production) • one mutagenized strain had especially high conversion efficiency • determine where the mutations were that caused this phenotype • we resequenced the 15MB genome with 454 Illumina, and SOLiD reads • 14 true point mutations in the entire genome • In about 15X nominal coverage each technology can find every point mutation with essentially no false positives

Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release

Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab

Informatics for next-generation sequence analysis – SNP calling

Informatics for next-generation sequence analysis – SNP calling

Presentation Transcript

Data + Analysis = Decision Support

PV92 PCR/Informatics Kit

MOST Maynard Operation Sequence Technique

Developing Medical Informatics Ontologies with Protégé

The Millennial Generation: The Current Generation in K-12 and the Next Generation in College Enrollment

Sequence analysis with Scripture

An Introduction to Next Generation Sequencing

Dynamic Programming: Edit Distance

Our Calling

Psi-Blast

Sequence Alignment

The Millennial Generation: The Next Generation in College Enrollment

Rapid Sequence Intubation

Code generation tools

Topics in Informatics

1-month Practical Course Genome Analysis Lecture 5: Multiple Sequence Alignment

Nominated by the Fellows of the American College of Medical Informatics and presented by

Linguistics 187/287 Week 6

Tools for multiple sequence alignment

LONDON CALLING!

UFE 2008 ANALYSIS

How Generation Z Differs from Millennials (and Some Similarities)