1 / 23

On Genome Assembly

On Genome Assembly. Abhiram Ranade IIT Bombay. Genome. Constituent of living cells that determines hereditary characteristics a.k.a. DNA Sequence of nucleobases : A denine, G uanine, T hymine, C ytosine ACCTGGA… Human genome: 3 billion nucleobases

kale
Download Presentation

On Genome Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Genome Assembly Abhiram Ranade IIT Bombay

  2. Genome • Constituent of living cells that determines hereditary characteristics a.k.a. DNA • Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine ACCTGGA… • Human genome: 3 billion nucleobases • Knowledge of sequence is very useful

  3. Genome Sequencing Biochemical techniques can “read” genomes of length ~ 700 nucleobases • Make many copies of the genome. • Break the copies randomly into pieces of length ~ 700. • Read the pieces. • Try to infer what original genome must have been. Genome Assembly

  4. Assembly Example • Input pieces “Reads”: abcd, cdefghi, hijkl • Assembly: • Input pieces “Reads”: abcd, cdefghi, hijkl, hicd • Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl abcdefghijkl

  5. Strategy • Characterize all possible assemblies of the given read set • Assign a probability to each assembly • Pick the assembly with the highest probability

  6. All possible assemblies • A = valid assembly of reads if • each read appears in A at least once. • Nothing else appears, i.e. A is made by pasting together the reads in possibly overlapping faction • Can we compute/represent the set {A | A is a valid assembly} Overlap/String Graph!

  7. Overlap Graph: Intuition • Vertices = Reads. • Edge from read u to read v if read u, read v likely to overlap in assembly. e.g. abcd, cdef => abcdef • Assembly = walk in the graph: will encourage overlaps

  8. Overlap graph • Vertices: reads + empty read ϕ • Edges: (ri, rj) : if long suffix of ri = prefix of rj abcdcdefghi • Long = ? Real genomes: 50? 100? Any value that indicates overlap is not coincidental • Edge label: portion of rj not belonging to overlap. (Long = 2) efghi cdefghi abcd

  9. Overlap graph, long=2 efghi jkl abcd cdefghi hijkl cd efghi ϕ abcd hicd hicd ϕ ϕ ϕ

  10. Overlap graph, long=2 efghi jkl abcd cdefghi hijkl cd efghi ϕ abcd hicd hicd ϕ ϕ ϕ

  11. Walk => Assembly • Assembly = Walk in the overlap graph which • Starts at ϕ, Ends atϕ • Passes through every vertex at least once. • Passes through every edge at least once? • Assembled sequence = concatenation of labels along the walk. • Every read appears in the sequence • Walk revisits ϕ: reconstruction is incomplete, in several pieces.

  12. Assembly => Walk • Input: Assembly A • Output: Walk which will generate A • Visit vertices in the order of appearance in A Overlap graph characterizes assemblies. Variations on graph also studied.

  13. Approaches to assembly • Occam’s Razor: Most likely = “Shortest” • Shortest walk that visits every vertex at least once: NP-hard • Shortest walk that visits every edge at least once: Chinese Postman problem. Polytime. • Pragmatic: Use some greedy approach to find above. • Model probability more accurately

  14. A Twist: pair constraints • Sequencing process may give additional constraints: distance from ri to rj in assembly is about D • Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

  15. Systematic estimation of probability of a given assembly

  16. Algebraic representation of walks • Walk is cyclic: number of times vertex entered = number of times it is exited. • Walk = fluid flow Total fluid coming in = Total fluid going out Xij = fluid going from i to j. = number of times walk goes from i to j Formulate conditions on Xij and solve

  17. Algebraic representation: Xij,δj Xij= Number of times walk goes from i to j. δj = Number of times walk goes over j Lij = Length of label of edge (i,j) Length of genome L = L may be known.

  18. Maximum likelihood reconstruction(Medvedev-Brudno 08) Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2,…rn) =Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn) =Pr(r1,r2,…,rn|A) * Pr(A) / Pr(r1,r2,…,rn) Standard assumption: Unconditional probability Pr(A) same for all A. Maximize Pr(r1,r2,…,rn|A) and output A that maximizes.

  19. Computing Pr(r1,r2,…,rn|A) • A = abcdefghicdefghijkl r1= abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd • Process of generating reads: • Pick a random starting point. • Pick a length at random • Pr(r2) = ? = 2/Length * Pr(read length = 7) = δ2/Length * Pr(read length = 7)

  20. Computing Pr(r1,r2,…,rn|A) Generative model: • For i=1 to n • Pick starting point for ri • Pick length Li • Probability of generating ri: = Number of times ri appears in A/Length of A * Probability of getting the correct length = δi/L * Pr(Li)

  21. Computing Pr(r1,r2,…,rn|A) • Pr = Πiδi/L * Pr(Li) • We want to pick A for which this probability is maximum • Score(A) = Πiδi/L * Pr(Li) • Best A will have max value of Πiδi/L • So now we have a program

  22. Finding the best assembly • Maximize Πiδi/L • s.t. • L known approximately. L = Lgeε ≈ Lg(1+ε) • Lg, Lij : constants. Solve for Xij≥0, δj≥0, ε • Convex optimization

  23. Concluding Remarks • Experiments seem to indicate our approach works well. www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf • Computationally intensive, but well founded. • May not be useful for large genomes – linear time algorithms only! • How to handle pair constraints: important open problem. • Graphs are everywhere!

More Related