KmerStream

KmerStream Streaming algorithms for k-mer abundance estimation Páll Melsted @pmelsted joint work with Bjarni V. Halldórsson

Error rates vs. Quality values • What error rates can we expect from NGS • Specifically whole genome sequencing with Illumina sequencing technology • How informative are quality values • Rubbish? • Worth using for analysis?

Quality values • A probability estimate that the basecall is correct • @SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C • Phred scale, • Pr[base call incorrect] ~ 10-Q/10 !=33, bad basecall

Error rates • What percentage of basecalls are correct • How to estimate • Align reads to a reference • Count mismatches and non-alignments • Correct for snps and variants. • Reference free • Whole genome assembly?

K-mer counting • Count k-mers, want k large, say ~31. GATTTGGGGTTCAAAGCAGTA GAT ATT TTT TTG TGG ... GAT 2 ATT 3 TTT 2 TTG 3 TGG 2 ... ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT

Errors and k-mers • Basecall errors impact many k-mers GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT ... AAGCAG AGCAGT GCAGTA GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC

Errors and k-mers • Basecall errors are not independent • Multiple errors more likely • Ends of reads contain more errors • K-mer error rate underestimates true basecall error rate • Discounts reads with many errors or errors at the ends • Can be off by a factor of 2

Frequency histograms • Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1

Naïve method • Assumptions: • Sampling from a genome of size G • Poisson distribution, Poi(λ), of coverage of each position • Each k-mer sampled is an error with probε independently. • When we sample an error k-mer, it is replaced by a single nucleotide substitution at random

Naïve model Sample random position • Probability that a k-mer has coverage 1 • εPr[error k-mer has cov 1]+ (1-ε) Pr[true k-mer has cov 1] Genome length G TGAC Introduce one error ε 1-ε Produce correct k-mer TGGC TGAC

Frequency moments • From the frequency histogram we define • fi = number of k-mers with coverage i • f1 = number of singletons • F0 = number of distinct k-mers = Σ fi • F1 = number of all k-mers = Σi fi

Fitting the model • 3 unknown parameters G, λ, ε • 3 k-mer frequency statistics, f1, F1, F0

Computing the moments • Count all k-mers? – very memory intensive • Sample k-mers (àla KmerGenie) • Streaming algorithm, KmerStream • Estimates f1, F0, F1 directly without storing any k-mers • Accuracy can be specified (default ~2%)

KmerStream • Very fast, 5-10s per million reads • Low memory overhead, ~11M • One pass over the dataset • Uses hashing to sample k-mers adaptively • Lossy counting similar to Bloom filter • Does not keep track of k-mers • 2-3x faster than KmerGenie, 10x better memory

Validation • Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times. KmerStream estimates True kmer counts

Real data • Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage. • KmerStream run for all samples, model fit to estimate k-mer error rates for k=31

K-mer error rates

Quality cutoff • Keep only k-mers in reads where quality is above q. • Run for q = 0, 13, 20, 30. • Should correspond to upper bound on error of1.0, 0.05, 0.01, 0.001

K-mer error rates Moving from q0 to q13 huge improvement q20 to q30 not recommended, 50% samples increased error rate

Wrap up • Quality values are informative • Can get speed up by prioritizing processing based on quality values e.g. alignment • Error rates are highly variable • Quality value cutoffs can be done on a case by case basis with minimal overhead.

Thank you • Paper on bioRxiv • Code on github.com/pmelsted/KmerStream • Ph.D. position available “Streaming algorithms for whole genome assembly.”

KmerStream

KmerStream

Presentation Transcript