1 / 12

Data formats

Data formats. Gabor T. Marth Boston College for folks developing data standards for 1000G analysis 1000 Genomes Meeting Philadelphia, November 10-11, 2008. Why have standard formats?. slide courtesy of Richard Durbin. Standard formats.

Download Presentation

Data formats

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data formats Gabor T. Marth Boston College for folks developing data standards for 1000G analysis 1000 Genomes Meeting Philadelphia, November 10-11, 2008

  2. Why have standard formats? slide courtesy of Richard Durbin

  3. Standard formats • aggregate data from different platforms on a common footing ABI/capillary 454 FLX 454 GS20 Illumina

  4. Standard formats • provide algorithms with a well-defined input and output • plug alternate tools into pipeline • compare performance • integrate results across different algorithms • capture “checkpoints” in the analysis pipeline

  5. Data types with standard formats

  6. Read data formats – SRF and FASTQ • What is the data: trace information, base calls, base qualities • Produced by base callers, used by read mappers/aligners • Standard formats FASTQ SRF

  7. Read data formats – SRF and FASTQ SRF (Sequence Read Format): • designed to store machine-specific trace information, alternative base calls, extended base quality value schemes • complex format • used mostly for archival FASTQ: • only stores base calls + 1 Q-value per base • simple format • the same for all platforms • the de facto format for downstream analysis • is there information in SRF (but not in FASTQ that is required by downstream analysis?

  8. Alignment formats • What is the data? • generated by read mapper / aligners / assemblers • used by e.g. allele callers, SV callers

  9. Alignment formats • A standard format (SAM, TAM, BAM) is being defined (Heng Li [Sanger], Bob Handsaker [Broad], etc.)… a standard is within reach • Compatible with all technologies (AB?), allows aggregation of data from different individuals, different platforms • “Lean and mean”  cannot be all-encompassing • Remaining issues: gapped / padded alignments, reads pairs, compression, indexing • Extremely high priority for 1000G data analysis

  10. SNP / short-INDEL allele calling • Data: SNP probability, individual genotype probabilities • Produced by SNP caller, used by downstream analysis

  11. Genotype likelihood format: GLF -----a----- -----a----- -----c----- -----c----- P(G1=aa|B1=aacc; Bi=aaaacc; Bn=cccc) P(G1=cc|B1=aacc; Bi=aaaacc;Bn= cccc) P(G1=ac|B1=aacc; Bi=aaaacc;Bn= cccc) P(B1=aacc|G1=aa) P(B1=aacc|G1=cc) P(B1=aacc|G1=ac) -----a----- -----a----- -----a----- -----a----- -----c----- P(Gi=aa|B1=aacc; Bi=aaaacc; Bn=cccc) P(Gi=cc|B1=aacc; Bi=aaaacc;Bn= cccc) P(Gi=ac|B1=aacc; Bi=aaaacc;Bn= cccc) P(Bi=aaaacc|Gi=aa) P(Bi=aaaacc|Gi=cc) P(Bi=aaaacc|Gi=ac) Prior(G1,..,Gi,.., Gn) -----c----- -----c----- -----c----- -----c----- P(Bn=cccc|Gn=aa) P(Bn=cccc|Gn=cc) P(Bn=cccc|Gn=ac) P(Gn=aa|B1=aacc; Bi=aaaacc; Bn=cccc) P(Gn=cc|B1=aacc; Bi=aaaacc;Bn= cccc) P(Gn=ac|B1=aacc; Bi=aaaacc;Bn= cccc) “genotype likelihoods” “genotype probabilities” P(SNP)

  12. Other data types that need standard format?

More Related