1 / 31

ILP-based maximum likelihood genome scaffolding

ILP-based maximum likelihood genome scaffolding. James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex Zelikovsky Georgia State University. De-novo Assembly. Genome. Reads. S equencing. Assembly. Scaffolds. S caffolding. Contigs. Why Scaffolding?. Annotation

norman
Download Presentation

ILP-based maximum likelihood genome scaffolding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ILP-based maximum likelihood genome scaffolding James Lindsay Ion Mandoiu University of Connecticut HamedSalooti Alex Zelikovsky Georgia State University

  2. De-novo Assembly Genome Reads Sequencing Assembly Scaffolds Scaffolding Contigs

  3. Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation! No scaffold gene XYZ Scaffold 5’ UTR gene XYZ 3’ UTR

  4. Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap filling • Structural variation! Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR Sanger Sequencing 5’ UTR gene XYZ 3’ UTR

  5. Why Scaffolding? • Annotation • Comparative biology • Re-sequencing and gap Filling • Structural variation!

  6. Read Pairs Informative Reads Paired Read Construction • Align each read against the contigs • Uniquely mapped reads • Save repetitive • Both reads in a pair must map to different contigs 2kb 2kb same strand and orientation R2 R1

  7. Linkage Information Possible States • Two contigs are adjacent if: • A read pair spans the contigs • State (A, B, C, D) • Depends on orientation of the read • Order of contigs is arbitrary • Each read pair can be “consistent” with one of the four states 5’ 3’ R2 R1 A B C D contigi contig j

  8. Scaffolding Problem Input • Contigs • Linkage information from paired reads Output • Contig orientation • Contig ordering • Relative distance between contigs Objective: Find the longest and most accurate scaffolds

  9. Existing Tools Bambus2 GRASS OPERA MIP SCAPRA SOPRA SGA SOAPDENOVO SSPACE

  10. SILP Flow

  11. Scaffolding Graph read pairs Scaffolding graph: G = (V, E, w) V = set of all contigs E = set of pairs of contigs connected with mapped read pairs of a particular state (A,B,C, or D) Edge weight = probability of read pairs being correctly aligned: • Amount of repeat overlap • Contig coverage dissimilarity contigs

  12. Structure of Scaffold Graph • Paired reads have upper bound • Contigs have minimum size • Upper bound of # contigs spanned Scaffold graph has bounded width (Opera) 2kb 2kb

  13. Elimination Order: SPQR-tree Bi-connected Elimination order Tri-connected

  14. Maximum Likelihood Contig Orientation • Given read pair r, let pr be probability of r being correctly aligned • For given orientation O, let RO be subset of all reads R agreeing with O • Probability of O being correct is estimated as • Log likelihood • Equivalent to maximize

  15. Integer Linear Program Formulation • Variables: • Binary Si= 0 if i-thcontig keeps default orientation, =1 if it is flipped • Binary Sij= 0 if contigs in (i,j)-edge are both flipped or both not, = 1 otherwise • Binary Aij=1 if the edge (i,j) in state A, = 0 otherwise (similarlyBij,Cij, Dij) • Weight of edge • Objective

  16. i i i i ILP Constraints j j j j i i j j k k k k • Connecting Si and Sij • Connecting Sijand Aij • Forbidding 2-cycles  • Forbidding 3-cycles

  17. ? Non-Serial Dynamic Programming ? 2-Component A ? ? ILP_A ? ? • 1-cuts Collapsing 1-Cut Splitting 1-Cut ? 1-Cut ? ? ? ? ? ILP_B ? ? 2-Component B ?

  18. Non-Serial Dynamic Programming • 2-cuts 3-Component A ? ILP_A + ILP_A  ILP_A  ? ? Collapsing 2-cut Splitting 2-Cut ? ? 2-Cut ? ? ILP_B ? ? ? 3-Component B

  19. A A B A B B Total Ordering via Bipartite Matching A C F A B F B C C D E C D E C D D ILP Bipartite Bipartite Total Output  graph matching  order E D E E F F F

  20. Gap Estimation via QP CONTIG i CONTIG i+1 CONTIG i+2 CONTIG i+3 • Maximum likelihood gap estimation following Opera and previous scaffolders

  21. Results • Verification on simulation • Simulate contigs from real draft assemblies (GAGE) & real reads • Control error rate • Actual orientation, order and position is known • Evaluation on real assembly • Draft Human genome & real reads • Alignment based evaluation is challenging

  22. GAGE Simulation • Step: • Staph, Rhodo and Chr14 assemblies • Mix all contig sizes, sample uniformly at random to build simulated contigs (or gaps) • Align reads against simulated contigs • Control error rates • Repeat 10x • Metrics: • Binary classification (n-1) real edges • Corrected N50 (break edges at bad points) • Runtime • Scaffold size distribution

  23. Simulation: MCC

  24. Simulation: N50: Staph

  25. Simulation: N50: Rhodo

  26. Simulation: N50: Chr14

  27. NA12878 2x: Runtime

  28. Metagenomics: GAGE Simulation

  29. Conclusions • Powerful and flexible ILP for genome scale problems • NSDP is a viable solution for big problems • ILP will probably work well in Metagenomicsdomain • Validation through N50, gene content, • Future work: • Validate through recent framework from Genome Biology 03/ 2014 • http://genomebiology.com/2014/15/3/R42 • http://genomebiology.com/2014/15/3/R42/table/T1

  30. Thank you!

  31. Scaffolding NSDP Algorithm Nonserial Dynamic Programming (NSDP) • Compute ILP solution in stages such that each uses results from the previous stage. • Use SPQR-tree to determine variable elimination order stack = [root of SPQR-tree] visited = {empty}while stack is not empty do p = stack.pop() foreach child q of p do if p not in visited then stack.push(p) endif endfor if p is root then solfinal = ILP(p) else (s,t) = getcut(p, parent(p)) sol00 = ILP(p, s=0, t=0), sol01 = ILP(p, s=0, t=1) sol01 = ILP(p, s=0, t=1), sol00 = ILP(p, s=1, t=1) endif endwhile

More Related