240 likes | 464 Views
On Genome Assembly. Abhiram Ranade IIT Bombay. Genome. Constituent of living cells that determines hereditary characteristics a.k.a. DNA Sequence of nucleobases : A denine, G uanine, T hymine, C ytosine ACCTGGA… Human genome: 3 billion nucleobases
E N D
On Genome Assembly Abhiram Ranade IIT Bombay
Genome • Constituent of living cells that determines hereditary characteristics a.k.a. DNA • Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine ACCTGGA… • Human genome: 3 billion nucleobases • Knowledge of sequence is very useful
Genome Sequencing Biochemical techniques can “read” genomes of length ~ 700 nucleobases • Make many copies of the genome. • Break the copies randomly into pieces of length ~ 700. • Read the pieces. • Try to infer what original genome must have been. Genome Assembly
Assembly Example • Input pieces “Reads”: abcd, cdefghi, hijkl • Assembly: • Input pieces “Reads”: abcd, cdefghi, hijkl, hicd • Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl abcdefghijkl
Strategy • Characterize all possible assemblies of the given read set • Assign a probability to each assembly • Pick the assembly with the highest probability
All possible assemblies • A = valid assembly of reads if • each read appears in A at least once. • Nothing else appears, i.e. A is made by pasting together the reads in possibly overlapping faction • Can we compute/represent the set {A | A is a valid assembly} Overlap/String Graph!
Overlap Graph: Intuition • Vertices = Reads. • Edge from read u to read v if read u, read v likely to overlap in assembly. e.g. abcd, cdef => abcdef • Assembly = walk in the graph: will encourage overlaps
Overlap graph • Vertices: reads + empty read ϕ • Edges: (ri, rj) : if long suffix of ri = prefix of rj abcdcdefghi • Long = ? Real genomes: 50? 100? Any value that indicates overlap is not coincidental • Edge label: portion of rj not belonging to overlap. (Long = 2) efghi cdefghi abcd
Overlap graph, long=2 efghi jkl abcd cdefghi hijkl cd efghi ϕ abcd hicd hicd ϕ ϕ ϕ
Overlap graph, long=2 efghi jkl abcd cdefghi hijkl cd efghi ϕ abcd hicd hicd ϕ ϕ ϕ
Walk => Assembly • Assembly = Walk in the overlap graph which • Starts at ϕ, Ends atϕ • Passes through every vertex at least once. • Passes through every edge at least once? • Assembled sequence = concatenation of labels along the walk. • Every read appears in the sequence • Walk revisits ϕ: reconstruction is incomplete, in several pieces.
Assembly => Walk • Input: Assembly A • Output: Walk which will generate A • Visit vertices in the order of appearance in A Overlap graph characterizes assemblies. Variations on graph also studied.
Approaches to assembly • Occam’s Razor: Most likely = “Shortest” • Shortest walk that visits every vertex at least once: NP-hard • Shortest walk that visits every edge at least once: Chinese Postman problem. Polytime. • Pragmatic: Use some greedy approach to find above. • Model probability more accurately
A Twist: pair constraints • Sequencing process may give additional constraints: distance from ri to rj in assembly is about D • Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl
Algebraic representation of walks • Walk is cyclic: number of times vertex entered = number of times it is exited. • Walk = fluid flow Total fluid coming in = Total fluid going out Xij = fluid going from i to j. = number of times walk goes from i to j Formulate conditions on Xij and solve
Algebraic representation: Xij,δj Xij= Number of times walk goes from i to j. δj = Number of times walk goes over j Lij = Length of label of edge (i,j) Length of genome L = L may be known.
Maximum likelihood reconstruction(Medvedev-Brudno 08) Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2,…rn) =Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn) =Pr(r1,r2,…,rn|A) * Pr(A) / Pr(r1,r2,…,rn) Standard assumption: Unconditional probability Pr(A) same for all A. Maximize Pr(r1,r2,…,rn|A) and output A that maximizes.
Computing Pr(r1,r2,…,rn|A) • A = abcdefghicdefghijkl r1= abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd • Process of generating reads: • Pick a random starting point. • Pick a length at random • Pr(r2) = ? = 2/Length * Pr(read length = 7) = δ2/Length * Pr(read length = 7)
Computing Pr(r1,r2,…,rn|A) Generative model: • For i=1 to n • Pick starting point for ri • Pick length Li • Probability of generating ri: = Number of times ri appears in A/Length of A * Probability of getting the correct length = δi/L * Pr(Li)
Computing Pr(r1,r2,…,rn|A) • Pr = Πiδi/L * Pr(Li) • We want to pick A for which this probability is maximum • Score(A) = Πiδi/L * Pr(Li) • Best A will have max value of Πiδi/L • So now we have a program
Finding the best assembly • Maximize Πiδi/L • s.t. • L known approximately. L = Lgeε ≈ Lg(1+ε) • Lg, Lij : constants. Solve for Xij≥0, δj≥0, ε • Convex optimization
Concluding Remarks • Experiments seem to indicate our approach works well. www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf • Computationally intensive, but well founded. • May not be useful for large genomes – linear time algorithms only! • How to handle pair constraints: important open problem. • Graphs are everywhere!