1 / 24

CRAM: reference-based compression format developed by Vadim Zalunin

CRAM: reference-based compression format developed by Vadim Zalunin. Data horror. EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file. The need for compression. Red alert. Compression, what is it?. BMP, 190 kb. PNG, 100 kb.

wanda-beard
Download Presentation

CRAM: reference-based compression format developed by Vadim Zalunin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CRAM: reference-based compression format developed by Vadim Zalunin

  2. Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file

  3. The need for compression Red alert

  4. Compression, what is it? BMP, 190 kb PNG, 100 kb JPG, 21 kb JPG, 4 kb LOSSLESS LOSSY

  5. Compression, when we know what to expect. BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb LOSSLESS LOSSY But the actual message is only 40 characters (bytes) long!

  6. Compression at it’s best "Five little ducks went swimming one day" compress uncompress IMAGE, 145 kb TEXT, 40 b IMAGE, 145 kb ~3500 times more efficient

  7. What are we talking about bug The bug’s DNA is hidden somewhere sample sequencing machines bunch of huge files

  8. Looking closer at the data It boils down to a long list of reads: read 1 read 2 read 3 ….. read bizzilion bunch of huge files Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.

  9. What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

  10. What is a Read? read name @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

  11. What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. Bases: ACGTN

  12. What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

  13. What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

  14. Reference based encoding Read start position Read end position

  15. Reference based encoding

  16. Reference based encoding Mismatching bases

  17. Lossy quality scores horizontal Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. vertical

  18. Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM

  19. Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline

  20. Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline Original SNPs Restored SNPs

  21. Comparison study:1K Genomes exomes

  22. CRAM NGS data compression CRAM lossless CRAM lossy CRAM very lossy Untreated Bits/base (bad) (good) Do nothing Lossless Lossy

  23. 20-fold Lossless 200-fold 2-fold Progressive application of compression Sample accessibility Hard Easy Low High Sample value

  24. References More information: • http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: • http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: • Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 • Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1

More Related