230 likes | 354 Views
A hierarchical approach to building contig scaffolds. Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004. Sequencing pipeline. Random sequencing un-related reads 500-700 base-pairs Assembly un-related contigs 5,000-10,000 base-pairs Scaffolding
E N D
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp. 149-159, 2004.
Sequencing pipeline • Random sequencing • un-related reads 500-700 base-pairs • Assembly • un-related contigs 5,000-10,000 base-pairs • Scaffolding • un-related scaffolds 30,000-50,000 base-pairs • Finishing/gap closure • completed genomes millions-billions of base-pairs
II I III IV II III IV I Scaffolding • Given a set of non-overlapping contigsorder and orient them along a chromosome
I II R F I II R F II I R F Clone-mates Insert R F Clone
Scaffolder output Physical gaps Sequencing gaps
Problems with the data • Incorrect sizing of inserts • cut from gel – sizing is subjective • error increases with size • Chimeras (ends belong to different inserts) • biological reasons (esp. for large sized inserts) • sample tracking (human error) • Software must handle a certain error rate.
Theoretical abstraction • Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.
Graph representation • Nodes: contigs • Directed edges: constraints on relative placement of contigs – relative order and relative orientation • Embedding: order (coordinate along chromosome) and orientation (strand sampled)
Challenges • Orientation – node coloring problem (forward/reverse) • feasibility – no cycles with odd number of “reversal” edges (blue edges) • optimality – remove minimum number of edges such that a solution exists (NP-hard)
Challenges • Ordering – generate a linear embedding • feasibility – lengths of parallel DAG paths are consistent • optimality – remove minimum number of edges such that DAG is feasible (NP-hard)
The real world • Use of scaffolds • Analysis – longest unambiguous sub-graphs • Finishing – present all “reliable” relationships between contigs • Sources of error • mis-assemblies • sizing errors (increases with library size) • chimeras
II II II II I’ I I I I III III III III Ambiguous scaffold
Hierarchical scaffolding • For each contig pair, consolidate all linking data into a single relationship – 2 correct links required
Hierarchical scaffolding • Use most reliable links to build scaffolds • Repeatedly build super-scaffolds based on less reliable linking data
Rationale Problem complexity problem size (#nodes) error rate Hierarchical step
Linking information • Overlaps • Mate-pair links • Similarity links • Physical markers • Gene synteny reference genome physical map
BAMBOO (BAMBUS) Best effort Attempt Multiple Branches allowed Order, Orient
BE EB min <= dist <= max Inputs • Set of contigs: names and lengths • Groups of contig links: • groups correspond to “quality” of links • link: relative distance between contig origins relative orientation of contigs • Priorities for each group – specify order in which links are considered
Outputs • XML representation of layout: • contig orientations • contig position (x-coordinate of contig origin) • links used to construct layout • Graphical display of the layout • uses GraphViz package from AT&T
1.0 release • XML input not yet supported • All scaffold placed in the same output file • Only Linux executable released • Hacker friendly
Current release: 2.33 • XML input and more general output module • Collection of input modules from common assembly formats • Better handling of priority data • Repeat masking features • More platforms supported and source code released as open source • http://amos.sourceforge.net
Future enhancements • Option to generate un-ambiguous (non-branching) scaffolds • Better layout algorithms • Specialized drawing tools • Interactive browser • Represent/handle multiple haplotypes
Acknowledgements • Dan Kosack • Steven Salzberg • Martin Shumway • Hean Koo • Luke Tallon • Jessica Vamathevan