1 / 21

KmerStream

KmerStream. Streaming algorithms for k- mer abundance estimation P áll Melsted @ pmelsted joint work with Bjarni V. Halld órsson. Error rates vs. Quality values . What error rates can we expect from NGS Specifically whole genome sequencing with Illumina sequencing technology

izzy
Download Presentation

KmerStream

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KmerStream Streaming algorithms for k-mer abundance estimation Páll Melsted @pmelsted joint work with Bjarni V. Halldórsson

  2. Error rates vs. Quality values • What error rates can we expect from NGS • Specifically whole genome sequencing with Illumina sequencing technology • How informative are quality values • Rubbish? • Worth using for analysis?

  3. Quality values • A probability estimate that the basecall is correct • @SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C • Phred scale, • Pr[base call incorrect] ~ 10-Q/10 !=33, bad basecall

  4. Error rates • What percentage of basecalls are correct • How to estimate • Align reads to a reference • Count mismatches and non-alignments • Correct for snps and variants. • Reference free • Whole genome assembly?

  5. K-mer counting • Count k-mers, want k large, say ~31. GATTTGGGGTTCAAAGCAGTA GAT ATT TTT TTG TGG ... GAT 2 ATT 3 TTT 2 TTG 3 TGG 2 ... ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT

  6. Errors and k-mers • Basecall errors impact many k-mers GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT ... AAGCAG AGCAGT GCAGTA GATTTGGGGTTCAAAGCAGTA GATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC

  7. Errors and k-mers • Basecall errors are not independent • Multiple errors more likely • Ends of reads contain more errors • K-mer error rate underestimates true basecall error rate • Discounts reads with many errors or errors at the ends • Can be off by a factor of 2

  8. Frequency histograms • Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1

  9. Naïve method • Assumptions: • Sampling from a genome of size G • Poisson distribution, Poi(λ), of coverage of each position • Each k-mer sampled is an error with probε independently. • When we sample an error k-mer, it is replaced by a single nucleotide substitution at random

  10. Naïve model Sample random position • Probability that a k-mer has coverage 1 • εPr[error k-mer has cov 1]+ (1-ε) Pr[true k-mer has cov 1] Genome length G TGAC Introduce one error ε 1-ε Produce correct k-mer TGGC TGAC

  11. Frequency moments • From the frequency histogram we define • fi = number of k-mers with coverage i • f1 = number of singletons • F0 = number of distinct k-mers = Σ fi • F1 = number of all k-mers = Σi fi

  12. Fitting the model • 3 unknown parameters G, λ, ε • 3 k-mer frequency statistics, f1, F1, F0

  13. Computing the moments • Count all k-mers? – very memory intensive • Sample k-mers (àla KmerGenie) • Streaming algorithm, KmerStream • Estimates f1, F0, F1 directly without storing any k-mers • Accuracy can be specified (default ~2%)

  14. KmerStream • Very fast, 5-10s per million reads • Low memory overhead, ~11M • One pass over the dataset • Uses hashing to sample k-mers adaptively • Lossy counting similar to Bloom filter • Does not keep track of k-mers • 2-3x faster than KmerGenie, 10x better memory

  15. Validation • Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times. KmerStream estimates True kmer counts

  16. Real data • Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage. • KmerStream run for all samples, model fit to estimate k-mer error rates for k=31

  17. K-mer error rates

  18. Quality cutoff • Keep only k-mers in reads where quality is above q. • Run for q = 0, 13, 20, 30. • Should correspond to upper bound on error of1.0, 0.05, 0.01, 0.001

  19. K-mer error rates Moving from q0 to q13 huge improvement q20 to q30 not recommended, 50% samples increased error rate

  20. Wrap up • Quality values are informative • Can get speed up by prioritizing processing based on quality values e.g. alignment • Error rates are highly variable • Quality value cutoffs can be done on a case by case basis with minimal overhead.

  21. Thank you • Paper on bioRxiv • Code on github.com/pmelsted/KmerStream • Ph.D. position available “Streaming algorithms for whole genome assembly.”

More Related