From calling bases to calling variants: Experiences with Illumina data

From calling bases to calling variants:Experiences with Illumina data Gerton Lunter Wellcome Trust Centre for Human Genetics

This talk • Refresher: Illumina sequencing • QC • What can go wrong • Useful QC statistics • Read mapping • Comparison of popular read mappers • Stampy • Indel and SNP calling • (Some results: 1000 Genomes indel calls)

7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000

1. Refresher: Illumina sequencing

Illumina sequencing

Illumina sequencing • 8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp

2. QC

Quality issues • Bases are identified by their fluorescent tag • Overlapping emission spectra • Single base per cycle: reversible terminator chemistry • Not perfect: fraction lags, fraction runs ahead: “Phasing” • Limits read length • Optimizing yield: cluster density • Higher densities mean more errors • Above an optimum, yield decreases • Partly signal processing issue: software improvements • Low amounts of initial DNA • Linker-linker hybrids; duplicated reads

Overlapping fluorescence spectra • C/A and G/T overlap • (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008

Refresher: Phred scores • Phred score = 10 log10( probability of error ) • 10: 10% error probability • 20: 1% error probability • 30: 0.1% error probability (one in 1,000) • 3 = 50%, 7 = 20% • 13 = 5%, 17 = 2% • 23 = 0.5%, 27 = 0.2% • 33 = 0.05%, 37 = 0.02%

Phasing June 2010 August 2009

Cluster density & other improvements August 2009: June 2010:

Library complexity, duplicate reads • Some sequences are read several times: • Low amount of initial material, many PCR copies • Optical duplicates; secondary cluster seeding • Problem for variant calling • Any PCR error will be seen twice: evidence for variant • Rate of duplicates is rarely >5% • Criterion: both ends of a PE read map to matching location • Can occur by chance, but low probability, except for very high coverage • Post processing: duplicate removal • Standard processing step (e.g. Samtools, Picard) • Useful statistic: • Duplicate fraction is approximately additive across lanes (same library) • 2x duplication fraction ≈ fraction of the library that was sequenced

Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson(α) Expected fraction of duplicates: e-α-1+α As a fraction of all reads sequenced: (e-α-1+α)/α = ½ α + …

Sequencing QC

QC statistics

QC statistics - coverage

QC statistics – quality scores

GATK recalibration tool

3. Read mapping

Read mapping • First processing step after sequencing: • Read mapping (most times) • Assembly (no reference sequence; specialized analyses) • Quality of mapping determines downstream results • Accessible genome • Biases (ref vs. variant) • Sensitivity (divergent reference; snps, indels, SV) • Specificity (calibration of mapping quality)

Read mapper comparison • Read mappers: • Maq • BWA • Eland • Novoalign • Stampy • Criteria: • Sensitivity (overall; divergent reference; variants) • Specificity (mapping quality calibration) • Speed

Sensitivity

Sensitivity - indels

Sensitivity – Divergent reference

Specificity – ROC curves ROC - indels

Performance on real data Proportion mapped to within 10kb of mate

Efficiency

Stampy – first part of algorithm read 4 bytes x 229 entry (2 Gb) hash table 15 bp subsequence candidate positions Remove rev-comp symmetry 29 bit word open addressing,cache-friendly

Second part: Fast candidate alignment Single-instruction-multiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers. Maximum indel size 15 bp.

Third part: Modeling mapping failures • Pseudo-bayesian posterior (using candidates, rather than all mapping positions) • Failure to find the correct candidate (2 or more mismatches in every 15bp subsequence) • Sequence not in reference (is sequence match better than expected best random match?)

4. SNP and indel calling

SNP calling • General idea: • Works quite well! Some caveats: • Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map) • Mapping errors are dependent: don’t include mapQ<10 • Base errors are not uniform (A/C/G/T): assume worst case (all identical) • Assumes no anomalies (segdups; alignments; indel/SV; …) • Hard problem: be conservative • Expected SNP rate (human): 10-3/nt. FPR of 10-5 required for 1% FDR • Filtering is required to achieve good FDR – or all data features must be adequately modeled

Indel calling • General idea: • Differences with SNP calling: • Pseudo-Bayes: cannot consider all possible variants/genotypesGenerate large set of candidatesFilter using goodness-of-fit test • Illumina reads do not have an explicit indel error model

Indel error model Homopolymer run length

Wrap up • GA-II produces large amounts of good data • Artefacts do occur, keep a look at QC statistics • Choice of mapper influences yield and quality • Variant calling: • Bayesian approaches work well • Some assumptions (independence) not met, hard to model • Filtering remains necessary

From calling bases to calling variants: Experiences with Illumina data