310 likes | 432 Views
ILP-based maximum likelihood genome scaffolding. James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex Zelikovsky Georgia State University. De-novo Assembly. Genome. Reads. S equencing. Assembly. Scaffolds. S caffolding. Contigs. Why Scaffolding?. Annotation
E N D
ILP-based maximum likelihood genome scaffolding James Lindsay Ion Mandoiu University of Connecticut HamedSalooti Alex Zelikovsky Georgia State University
De-novo Assembly Genome Reads Sequencing Assembly Scaffolds Scaffolding Contigs
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation! No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation! Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR
Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!
Read Pairs Informative Reads Paired Read Construction • Align each read against the contigs • Uniquely mapped reads • Save repetitive • Both reads in a pair must map to different contigs 2kb 2kb same strand and orientation R2 R1
Linkage Information Possible States • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states 5’ 3’ R2 R1 A B C D contigi contig j
Scaffolding Problem Input • Contigs • Linkage information from paired reads Output • Contig orientation • Contig ordering • Relative distance between contigs Objective: Find the longest and most accurate scaffolds
Existing Tools Bambus2 GRASS OPERA MIP SCAPRA SOPRA SGA SOAPDENOVO SSPACE
Scaffolding Graph read pairs Scaffolding graph: G = (V, E, w) V = set of all contigs E = set of pairs of contigs connected with mapped read pairs of a particular state (A,B,C, or D) Edge weight = probability of read pairs being correctly aligned: • Amount of repeat overlap • Contig coverage dissimilarity contigs
Structure of Scaffold Graph • Paired reads have upper bound • Contigs have minimum size • Upper bound of # contigs spanned Scaffold graph has bounded width (Opera) 2kb 2kb
Elimination Order: SPQR-tree Bi-connected Elimination order Tri-connected
Maximum Likelihood Contig Orientation • Given read pair r, let pr be probability of r being correctly aligned • For given orientation O, let RO be subset of all reads R agreeing with O • Probability of O being correct is estimated as • Log likelihood • Equivalent to maximize
Integer Linear Program Formulation • Variables: • Binary Si= 0 if i-thcontig keeps default orientation, =1 if it is flipped • Binary Sij= 0 if contigs in (i,j)-edge are both flipped or both not, = 1 otherwise • Binary Aij=1 if the edge (i,j) in state A, = 0 otherwise (similarlyBij,Cij, Dij) • Weight of edge • Objective
i i i i ILP Constraints j j j j i i j j k k k k • Connecting Si and Sij • Connecting Sijand Aij • Forbidding 2-cycles • Forbidding 3-cycles
? Non-Serial Dynamic Programming ? 2-Component A ? ? ILP_A ? ? • 1-cuts Collapsing 1-Cut Splitting 1-Cut ? 1-Cut ? ? ? ? ? ILP_B ? ? 2-Component B ?
Non-Serial Dynamic Programming • 2-cuts 3-Component A ? ILP_A + ILP_A ILP_A ? ? Collapsing 2-cut Splitting 2-Cut ? ? 2-Cut ? ? ILP_B ? ? ? 3-Component B
A A B A B B Total Ordering via Bipartite Matching A C F A B F B C C D E C D E C D D ILP Bipartite Bipartite Total Output graph matching order E D E E F F F
Gap Estimation via QP CONTIG i CONTIG i+1 CONTIG i+2 CONTIG i+3 • Maximum likelihood gap estimation following Opera and previous scaffolders
Results • Verification on simulation • Simulate contigs from real draft assemblies (GAGE) & real reads • Control error rate • Actual orientation, order and position is known • Evaluation on real assembly • Draft Human genome & real reads • Alignment based evaluation is challenging
GAGE Simulation • Step: • Staph, Rhodo and Chr14 assemblies • Mix all contig sizes, sample uniformly at random to build simulated contigs (or gaps) • Align reads against simulated contigs • Control error rates • Repeat 10x • Metrics: • Binary classification (n-1) real edges • Corrected N50 (break edges at bad points) • Runtime • Scaffold size distribution
Conclusions • Powerful and flexible ILP for genome scale problems • NSDP is a viable solution for big problems • ILP will probably work well in Metagenomicsdomain • Validation through N50, gene content, • Future work: • Validate through recent framework from Genome Biology 03/ 2014 • http://genomebiology.com/2014/15/3/R42 • http://genomebiology.com/2014/15/3/R42/table/T1
Scaffolding NSDP Algorithm Nonserial Dynamic Programming (NSDP) • Compute ILP solution in stages such that each uses results from the previous stage. • Use SPQR-tree to determine variable elimination order stack = [root of SPQR-tree] visited = {empty}while stack is not empty do p = stack.pop() foreach child q of p do if p not in visited then stack.push(p) endif endfor if p is root then solfinal = ILP(p) else (s,t) = getcut(p, parent(p)) sol00 = ILP(p, s=0, t=0), sol01 = ILP(p, s=0, t=1) sol01 = ILP(p, s=0, t=1), sol00 = ILP(p, s=1, t=1) endif endwhile