230 likes | 369 Views
Scaffolding Large Genomes Using Integer Linear Programming. James Lindsay* , Hamed Salooti , Alex Zelikovski , Ion Mandoiu *. University of Connecticut*. Georgia State University. De-novo Assembly Paradigm. The Reads. The Genome. S equencing. Assembly. The Scaffolds. S caffolding.
E N D
Scaffolding Large Genomes Using Integer Linear Programming James Lindsay*, HamedSalooti, Alex Zelikovski, Ion Mandoiu* University of Connecticut* Georgia State University
De-novo Assembly Paradigm The Reads The Genome Sequencing Assembly The Scaffolds Scaffolding The Contigs
Why Scaffolding? No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!
Why Scaffolding? Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation!
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!
Read Pairs Informative Reads Paired Read Construction 2kb 2kb same strand and orientation R2 R1 • Align each read against the contigs • Only accept uniquely mapped reads • Use the non-unique reads later • Both reads in a pair must map to different contigs
Linkage Information Possible States 5’ 3’ R2 R1 A B C D contigi contig j • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states
The Scaffolding Problem • Given • Contigs • Paired reads • Find • Orientation • Ordering • Relative Distance • Goal • Recreate true scaffolds • Possible Objectives • Un-weighted • Max number of consistentread pairs • Weighted • Each states is weighted: • Overlap with repeat • Deviation of expected distance • …
Graph Representation E, set of Using input we can define a scaffolding graph: This is an undirected multi-graph Assume it is connected
Integer Linear Program Formulation Variables Contig Orientation: Pairwise Contig Consistency: Contig Pair State: ,, Objective Maximize weight of consistent pairs
Constraints Pairwise Orientation Mutually Exclusivity Forbid 2 and 3 Cycles Explicitly
Graph Decomposition: Articulation Points solve Articulation point
Graph Decomposition: 2-cuts 2-cut + + - - + - + -
Non-Serial Dynamic Programming • SPQR-tree to scheduledecomposition • Traverse tree using DFS • NSDP utilizes solutions of previous stage in current stage
Post Processing ILP Solution outgoing incoming A A B B C C D D E E ILP Solution F F B D F A E C May have cycles Not a total ordering for each connected components • Bipartite matching • Objectives: • Max weight • Max cardinality • Max cardinality / Max weight
Testing Framework Venter Genome • 4x Assembly
Testing Metrics • Computer Scientists • Finding Scaffold = Binary Classification Test • n contigs, try to predict n-1 adjacencies • TP,FP,TN,FN, Sensitivity, PPV • Biologists (main focus) • N50 (basically average scaffold size, ignore gaps) • TP50 • Break scaffold at incorrect edges, then find N50
Conclusions • Success • ILP solves scaffolding problem! • NSDP works. • Improvements • Finalize large test cases (then publish?!) • Practical considerations (read style, multi-libraries, merge ctgs) • Future Work • Where else can I apply NSDP? • Scaffold before assembly?? • Structural Variation??