120 likes | 155 Views
Data formats. Gabor T. Marth Boston College for folks developing data standards for 1000G analysis 1000 Genomes Meeting Philadelphia, November 10-11, 2008. Why have standard formats?. slide courtesy of Richard Durbin. Standard formats.
E N D
Data formats Gabor T. Marth Boston College for folks developing data standards for 1000G analysis 1000 Genomes Meeting Philadelphia, November 10-11, 2008
Why have standard formats? slide courtesy of Richard Durbin
Standard formats • aggregate data from different platforms on a common footing ABI/capillary 454 FLX 454 GS20 Illumina
Standard formats • provide algorithms with a well-defined input and output • plug alternate tools into pipeline • compare performance • integrate results across different algorithms • capture “checkpoints” in the analysis pipeline
Read data formats – SRF and FASTQ • What is the data: trace information, base calls, base qualities • Produced by base callers, used by read mappers/aligners • Standard formats FASTQ SRF
Read data formats – SRF and FASTQ SRF (Sequence Read Format): • designed to store machine-specific trace information, alternative base calls, extended base quality value schemes • complex format • used mostly for archival FASTQ: • only stores base calls + 1 Q-value per base • simple format • the same for all platforms • the de facto format for downstream analysis • is there information in SRF (but not in FASTQ that is required by downstream analysis?
Alignment formats • What is the data? • generated by read mapper / aligners / assemblers • used by e.g. allele callers, SV callers
Alignment formats • A standard format (SAM, TAM, BAM) is being defined (Heng Li [Sanger], Bob Handsaker [Broad], etc.)… a standard is within reach • Compatible with all technologies (AB?), allows aggregation of data from different individuals, different platforms • “Lean and mean” cannot be all-encompassing • Remaining issues: gapped / padded alignments, reads pairs, compression, indexing • Extremely high priority for 1000G data analysis
SNP / short-INDEL allele calling • Data: SNP probability, individual genotype probabilities • Produced by SNP caller, used by downstream analysis
Genotype likelihood format: GLF -----a----- -----a----- -----c----- -----c----- P(G1=aa|B1=aacc; Bi=aaaacc; Bn=cccc) P(G1=cc|B1=aacc; Bi=aaaacc;Bn= cccc) P(G1=ac|B1=aacc; Bi=aaaacc;Bn= cccc) P(B1=aacc|G1=aa) P(B1=aacc|G1=cc) P(B1=aacc|G1=ac) -----a----- -----a----- -----a----- -----a----- -----c----- P(Gi=aa|B1=aacc; Bi=aaaacc; Bn=cccc) P(Gi=cc|B1=aacc; Bi=aaaacc;Bn= cccc) P(Gi=ac|B1=aacc; Bi=aaaacc;Bn= cccc) P(Bi=aaaacc|Gi=aa) P(Bi=aaaacc|Gi=cc) P(Bi=aaaacc|Gi=ac) Prior(G1,..,Gi,.., Gn) -----c----- -----c----- -----c----- -----c----- P(Bn=cccc|Gn=aa) P(Bn=cccc|Gn=cc) P(Bn=cccc|Gn=ac) P(Gn=aa|B1=aacc; Bi=aaaacc; Bn=cccc) P(Gn=cc|B1=aacc; Bi=aaaacc;Bn= cccc) P(Gn=ac|B1=aacc; Bi=aaaacc;Bn= cccc) “genotype likelihoods” “genotype probabilities” P(SNP)