240 likes | 320 Views
CRAM: reference-based compression format developed by Vadim Zalunin. Data horror. EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file. The need for compression. Red alert. Compression, what is it?. BMP, 190 kb. PNG, 100 kb.
E N D
CRAM: reference-based compression format developed by Vadim Zalunin
Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file
The need for compression Red alert
Compression, what is it? BMP, 190 kb PNG, 100 kb JPG, 21 kb JPG, 4 kb LOSSLESS LOSSY
Compression, when we know what to expect. BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb LOSSLESS LOSSY But the actual message is only 40 characters (bytes) long!
Compression at it’s best "Five little ducks went swimming one day" compress uncompress IMAGE, 145 kb TEXT, 40 b IMAGE, 145 kb ~3500 times more efficient
What are we talking about bug The bug’s DNA is hidden somewhere sample sequencing machines bunch of huge files
Looking closer at the data It boils down to a long list of reads: read 1 read 2 read 3 ….. read bizzilion bunch of huge files Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.
What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.
What is a Read? read name @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.
What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. Bases: ACGTN
What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)
What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.
Reference based encoding Read start position Read end position
Reference based encoding Mismatching bases
Lossy quality scores horizontal Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. vertical
Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM
Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline
Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline Original SNPs Restored SNPs
CRAM NGS data compression CRAM lossless CRAM lossy CRAM very lossy Untreated Bits/base (bad) (good) Do nothing Lossless Lossy
20-fold Lossless 200-fold 2-fold Progressive application of compression Sample accessibility Hard Easy Low High Sample value
References More information: • http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: • http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: • Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 • Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1