10 likes | 120 Views
FASTG: Representing the true information content of a genome assembly. Iain MacCallum, David B. Jaffe Broad Institute of MIT and Harvard, Cambridge, MA In collaboration with: Michael Schatz, Daniel Rokhsar , and Assemblathon Format Group. The Solution: FASTG
E N D
FASTG: Representing the true information content of a genome assembly Iain MacCallum, David B. Jaffe Broad Institute of MIT and Harvard, Cambridge, MA In collaboration with: Michael Schatz, Daniel Rokhsar, and Assemblathon Format Group • The Solution: FASTG • Enhanced FASTA to capture known complexity of the genome • Superset of FASTA • Faithfully represents genome assemblies without error or loss of information • Hybrid approach: • - preserves underlying linearity of the genome, but: • - captures nonlinear complexity • Easily converted to FASTA • The Problem: FASTA is flat • The FASTA format is the standard way to represent an assembly: • Linear representation of the genome • Easy to parse and human readable • Provides a simple co-ordinate system that allows easy annotation • Supported by many tools • Been the warhorse for ~20 years • FASTG captures important gap information • Gaps in FASTA often hide additional information, e.g. frame shifts • FASTG retains this information >scaffold TACTAGGCNNNNNNNATTAGGCCG TGCNNNNNNNNNNNGCGCCGTTAC CATTCNNNNNNACTGCCGTTGACT >assembly; ATCGGCNNNN[4:gap:size=(4,3..9)]ATTACCTG GCTTATAC[1:alt|C,G]TACCCGATACGTTTACGGTA TACGAAAAA[5:tandem:size=(5,4..11)|A]TCT AT… …CT TAC… …ACT Assembly gap with complex contents FASTA CATAT GATGT >scaffold1 CATATNNNNNNNNNNGATGT Any information about the gap content is lost T FASTG >scaffold1; CATATNNNNNNNNNN [10:gap: size=(10,5..20), start=(a,e),end=(d,g)| >a:b,c,d;CA >b:c,d;T >c:f,g;GAC >d;AA >e:f,g;TT >f:g;A >g;AGT ]GATGT • But FASTA has a number of limitations: AA CA • FASTA forces assemblers to make mistakes • Strictly linear nature forces assemblers to introduce errors: • These simple events are difficult to represented in the FASTA format • The assembler is forced to choose, resulting in a loss of information and errors. • Quality scores, annotation, and IUPAC codes provide only a partial solution • FASTG encodes all ambiguities • FASTG natively encodes ambiguities that are lost in FASTA • Optional properties (probability, copy-number, etc.) can be associated with these events, e.g. CATAT GATGT GAC AGT TT A A[1:alt|A,T] ACATT TACTG ACATT TACTG ACATT A TACTG T Assembler forced to pick A or T Uncertain base or SNP A 5Cs ACATT 6Cs TACTG FASTG captures possible content of the gap ACATT CCCCCC[6:tandem:size=(6,5..7)|C] TACTG Putative gap sequences derived from FASTG, e.g. 1) CATGACAAGT 2) CATTTAA 3) TTAAAAGT ACATT 7Cs TACTG 7Cs Assembler forced to chose the repeat length Uncertain tandem repeat CGAGG ACATT TACTG • FASTG is FASTA compatible • FASTG looks like FASTA • FASTG can be easily converted to FASTA • Any existing tool that works on FASTA can use converted FASTG • FASTG FASTA conversion • FASTG can easily be converted to FASTA by removing the FASTG extensions “[…]” • Conversion can be done with a simple shell or perl script • The resulting FASTA can be processed by existing tools • FASTG and derived FASTA files share the same base co-ordinate system • FASTG extensions plus start location = Markup • No additional markup language required • FASTA + Markup will produce the original FASTG • Can convert markup to existing annotation formats • - but only for a subset of FASTG features ACATT CGAGG[5:digraph:path=(a)|>a;CGAGG>b;AAGCC] TACTG AAGCC ACATT AAGCC TACTG Assembler forced to chose a haplotype Haplotype separation A[1:alt:allele|A,T] • FASTA cannot represent graph assemblies • Not all assemblies can be reduced to a linear form, due to: • Polymorphism that cannot be linearized • Long repeats that cannot be bridged with jumping data • Inversions that cannot be disambiguated • Assemblies must be broken into linear sections, losing information • FASTG captures graphs using a hierarchic approach • FASTG is FASTA-like, preserves linearity and keeps local complexity local • FASTG is easy to use FASTG FASTA Genome Assembly graph >contig1; TACCGCNNNN[4:gap:size=(4,3..5)]AGCCTGCC GTTATAC[1:alt:allele|C,G]TCCCTGGATACGTT TAGGATATAT[6:tandem:size=(3,2..5)|AT]CC >contig1 TACCGCNNNNAGCCTGCC GTTATACCTCCCTGGATA CGTTTAGGATATATCC Jumping libraries too short to disambiguate the repeat ContigC + Markup ContigB A Long Repeat B C D >contig1; 6 [4:gap:size=(4,3..5)] 26 [1:alt:allele|C,G] 52 [6:tandem:size=(3,2..5)|AT] Long Repeat Long imperfect repeat Uncertain tandem repeat ContigA Single base difference Long Repeat • Coming soon: FASTG support in ALLPATHS-LG • Our genome assembler ALLPATHS-LG will soon produce FASTG assemblies • For the latest ALLPATHS-LG news visit our blog: A Global graph structure encoded here D C FASTG D Long Repeat >ContigA:ContigC; TCGA…[7:tandem:size=(7,6..9)|T]…CATG >ContigB:ContigC; ATAGCG…ATCCAT >ContigC:ContigA,ContigB; CGTA…[1:alt|C,G]…AATC B A B C http://www.broadinstitute.org/software/allpaths-lg/blog/ FASTA Graph