650 likes | 776 Views
THE RNA DETECTIVE GAME: FINDING RNA CHAINS FROM FRAGMENTS. RNA. Detective. Fred Roberts, Rutgers University. DNA and RNA. Deoxyribonucleic acid, DNA, is the basic building block of inheritance. DNA can be thought of as a chain consisting of bases.
E N D
THE RNA DETECTIVE GAME: FINDING RNA CHAINS FROM FRAGMENTS RNA Detective Fred Roberts, Rutgers University
DNA and RNA Deoxyribonucleic acid, DNA, is the basic building block of inheritance. DNA can be thought of as a chain consisting of bases. Each base is one of four possible chemicals: Thymine (T), Cytosine (C), Adenine (A), Guanine (G)
DNA and RNA Some DNA chains: GGATCCTGG, TTCGCAAAAAGAATC Real DNA chains are long: Algae (P. salina): 6.6x105 bases long Slime mold (D. discoideum): 5.4x107 bases long
DNA and RNA Insect (D. melanogaster – fruit fly): 1.4x108 bases long Bird (G. domesticus): 1.2x109 bases long
DNA and RNA Human (H. sapiens): 3.3x109 bases long The sequence of bases in DNA encodes certain genetic information. In particular, it determines long chains of amino acids known as proteins.
DNA and RNA How many possible DNA chains are there in humans?
Aside: Counting Fundamental methods of combinatorics are important in mathematical biology.
The Product Rule How many sequences of 0’s and 1’s are there of length 2? There are 2 ways to choose the first digit and no matter how we choose the first digit, there are two ways to choose the second digit. Thus, there are 2x2 = 22 = 4 ways to choose the sequence. 00, 01, 10, 11 How many sequences are there of length 3? By similar reasoning: 2x2x2 = 23.
The Product Rule Is this interesting?
The Product Rule Boring!
The Product Rule Really boring!
The Product Rule Counting may be boring at times, but we will see that it can be really powerful.
The Product Rule Product Rule: If something can happen in n1 ways and no matter how the first thing happens, a second thing can happen in n2 ways, then the two things together can happen in n1 x n2 ways. More generally, if something can happen in n1 ways and no matter how the first thing happens, a second thing can happen in n2 ways, and no matter how the first two things happen a third thing can happen in n3 ways, … then all the things together can happen in n1 x n2 x n3 ways.
DNA and RNA How many possible DNA chains are there in humans? How many DNA chains are there with two bases? Answer (Product Rule): 4x4 = 42 = 16. There are 4 choices for the first base and, for each such choice, 3 choices for the second base. How many with 3 bases? How many with n bases?
DNA and RNA How many with 3 bases? 43 = 64 How many with n bases? 4n How many human DNA chains are possible? 4^(3.3x109) This is greater than 10^(1.98x109) (1 followed by 198 million zeroes!)
DNA and RNA RNA is a “messenger molecule” whose links are defined from DNA. An RNA chain has at each link one of four bases. The possible bases are the same as those in DNA except that the base Uracil (U) replaces the base Thymine (T).
The RNA Detective Game Sample RNA chains: GGCAUUGGA, UAUAUGCGGCUUC RNA chains are very long. Can we discover what they look like without actually observing them? Trick: Use enzymes.
The RNA Detective Game Some enzymes break up an RNA chain into fragments after each G link. Some enzymes break up the chain after each C or U link. Consider the chain CCGGUCCGAAAG Applying the G enzyme breaks the chain into the following fragments: G fragments: CCG, G, UCCG, AAAG We know that these are the fragments, but we do not know the order in which they appear. How many possible chains have these four fragments?
The RNA Detective Game Chain: CCGGUCCGAAAG G fragments: CCG, G, UCCG, AAAG Product rule again: 4 choices for the first fragment, for each such choice 3 choices for second fragment, … There are 4x3x2x1 = 4! = 24 possible chains. One chain corresponding to each permutation of these four fragments. One such chain different from the original: UCCGGCCGAAAG
The RNA Detective Game Chain: CCGGUCCGAAAG Suppose we instead apply the U,C enzyme. We get the following fragments: U,C fragments: C, C, GGU, C, C, GAAAG How many chains are there with these fragments? Is 6! = 720 the correct answer??? Two of the permutations are the one that takes the fragments in the order given and the one that takes the second fragment first and the first second and all others in this order. They give rise to the same chain.
The RNA Detective Game So 6! is wrong. What is the answer?? What if the fragments were C, C, C, C, C There are 5! permutations of these fragments, but only one RNA chain with these fragments: CCCCC
Multinomial Coefficients Putting n distinguishable balls into k distinguishable boxes: The number of ways to put n1 balls into the first box, n2 balls into the second box, …, nk balls into the kth box is denoted by C(n;n1,n2,…,nk), where n = n1 + n2 + … nk.
Multinomial Coefficients Theorem: C(n;n1,n2,…,nk) = n!/n1!n2!...nk! Example: How many RNA chains of length 6 have 3 C’s and 3 A’s? Think of 2 boxes, a C box and an A box. How many ways are there to put 3 positions (balls) into the C box and 3 into the A box? Answer: C(6;3,3) = 6!/3!3! = 20. Some of these are: CACACA, ACACAC, AAACCC.
Multinomial Coefficients If a 6-link RNA chain is chosen at random, what is the probability of obtaining one with 3 C’s and 3 A’s? Answer: There are 46 possible RNA chains of length 6. The probability is therefore C(6;3,3)/46 = 20/4096 .005.
Multinomial Coefficients The number of 10-link RNA chains consisting of 3 A’s, 2 C’s, 2 U’s, and 3 G’s is C(10;3,2,2,3) = 25,200 What if we know they end in AAG? Then, only the first 7 positions need to be filled, and 2 A’s and one G are already used up. Hence, the answer is C(7;1,2,2,2) = 630 Notice how knowing the end of a chain can dramatically reduce the number of possible chains.
The RNA Detective Game Recall that we have the following U,C fragments: C, C, GGU, C, C, GAAAG The number of RNA chains with these fragments is not 6! = 720. Think of having 6 positions (there are 6 fragments) and assigning 4 positions to the C box, 1 to the GGU box, and one to the GAAAG box. Then the number of ways of doing this is C(6;4,1,1) = 6!/4!1!1! = 30
The RNA Detective Game U,C fragments: C, C, GGU, C, C, GAAAG Actually, this computation is still a bit off, though not because the combinatorial argument is wrong. Notice that the fragment GAAAG does not end in U or C. Thus, we know it comes last. There are 5 remaining U,C fragments. The number of chains beginning with these 5 fragments is given by C(5;4,1) = 5 Beginning of the chains: CCCCGGU, CCCGGUC, CCGGUCC, CGGUCCC, GGUCCCC
The RNA Detective Game We get all chains with the given U,C fragments by adding GAAAG to the end of each of these: CCCCGGUGAAAG CCCGGUCGAAAG CCGGUCCGAAAG CGGUCCCGAAAG GGUCCCCGAAAG
The RNA Detective Game Thus, there are 24 possible chains with the given G fragments and 5 with the possible U,C fragments. But: We have not yet combined our knowledge of both G and U,C fragments. G fragments: CCG, G, UCCG, AAAG U,C fragments: C, C, GGU, C, C, GAAAG Which of the 5 chains with these U,C fragments has the right G fragments?
The RNA Detective Game G fragments: CCG, G, UCCG, AAAG U,C fragments: C, C, GGU, C, C, GAAAG Which of the 5 chains with these U,C fragments has the right G fragments? CCCCGGUGAAAG CCCGGUCGAAAG CCGGUCCGAAAG CGGUCCCGAAAG GGUCCCCGAAAG CCCCGGUGAAAG does not: It has CCCCG as a G fragment. What about the others?
The RNA Detective Game Checking the remaining 4 possible RNA chains with the given U,C fragments shows that only the third one, CCGGUCCGAAAG has the given G fragments. Hence, we have recovered the initial chain. This is an example of recovery of an RNA chain given a complete digest by enzymes. How remarkable is it that we could recover the initial RNA chain this way?
The RNA Detective Game CCGGUCCGAAAG How many RNA chains are there with the same bases as this chain? There are 12 bases: 4 C’s, 4 G’s, 3 A’s, and 1 U. The number of chains with these bases is given by C(12;4,4,3,1) = 138,600 Thus, knowing the number of bases is not nearly as useful as knowing the fragments.
The RNA Detective Game Another example. G fragments: UG, ACG, AC U,C fragments: U, GAC, GAC Step 1: Does any fragment have to come last?
The RNA Detective Game G fragments: UG, ACG, AC U,C fragments: U, GAC, GAC Step 1: Does any fragment have to come last? None of the U,C fragments has to come last. However, the G fragment AC has to come last. Thus, the other two G fragments come first in some order and there are only two possible RNA chains with these G fragments: UGACGAC, ACGUGAC
The RNA Detective Game G fragments: UG, ACG, AC U,C fragments: U, GAC, GAC There are only two possible RNA chains with these G fragments: UGACGAC, ACGUGAC The latter has AC as a U,C fragment. So, the former is the correct chain.
The RNA Detective Game Is it always possible to completely recover the original RNA chain given its G fragments and U,C fragments? RNA
The RNA Detective Game Is it always possible to completely recover the original RNA chain given its G fragments and U,C fragments? No: sometimes the solution is ambiguous. Exercise: Find two RNA chains with the same G and U,C fragments.
Eulerian Paths Surprisingly, eulerian paths in multidigraphs can be used to help with the RNA detective game. When a digraph is allowed to have more than one arc from vertex x to vertex y, we call it a multidigraph. A path in a multidigraph is called eulerian if it uses every arc once and only once. (Recall the Konigsberg Bridge Problem.) A closed path (one that ends where it starts) is eulerian if it is eulerian as a path.
Eulerian Paths d a c b e eulerian closed path: a, b, c, d, b, e, a
Eulerian Paths d a c b e eulerian path: a, b, c, d, b, e
Eulerian Paths When does a multidigraph have an eulerian path or closed path? Theorem (I.J. Good, 1946): A connected multidigraph has an eulerian closed path iff for every vertex, the indegree (number of incoming arcs) equals the outdegree (number of outgoing arcs). Theorem (I.J. Good, 1946): A connected multidigraph has an eulerian path iff for all vertices with the possible exception of two, indegree equals outdegree, and for at most two vertices, indegree and outdegree differ by one.
Eulerian Paths a b d a b c
Eulerian Paths Note that these theorems hold if there are loops from a vertex to itself. A loop adds 1 to indegree and 1 to outdegree. Thus, loops do not affect the existence of eulerian paths or closed paths.
Eulerian Paths and the RNA Detective Game Assume that there are at least two G fragments and at least two U,C fragments. Otherwise, we can recover the original chain. Example: G fragments: CCG, G, UCACG, AAAG, AA U,C fragments: C, C, GGU, C, AC, GAAAGAA
Eulerian Paths and the RNA Detective Game G fragments: CCG, G, UCACG, AAAG, AA U,C fragments: C, C, GGU, C, AC, GAAAGAA Step 1: Break down each fragment after each G, U, or C. E.g.: GAAAGAA becomes GxAAAGxAA GGU becomes GxGxU UCACG becomes UxCxACxG Each piece is called an extended base. All extended bases in a fragment except first and last are called interior extended bases.
Eulerian Paths and the RNA Detective Game G fragments: CCG, G, UCACG, AAAG, AA U,C fragments: C, C, GGU, C, AC, GAAAGAA Step 2: Use the extended base breakup of fragments to find the beginning and end of the RNA chain. Start by making two lists All interior extended bases of all fragments: C, C, AC, G, AAAG Fragments with one extended base: G, AAAG, AA, C, C, C, AC
Eulerian Paths and the RNA Detective Game All interior extended bases of all fragments: C, C, AC, G, AAAG Fragments with one extended base: G, AAAG, AA, C, C, C, AC Theorem: Every entry on the first list is on the second list. There are always exactly two entries on the second list not on the first. One of these is the first extended base of the entire RNA chain and the other is the last. Thus: chain begins in AA or C and ends in AA or C. How do you tell how it ends?
Eulerian Paths and the RNA Detective Game Thus: chain begins in AA or C and ends in AA or C. How do you tell how it ends? One of these must be from an abnormal fragment: a G fragment that doesn’t end in G or a U,C fragment that doesn’t end in U or C. G fragments: CCG, G, UCACG, AAAG, AA U,C fragments: C, C, GGU, C, AC, GAAAGAA AA is such an abnormal fragment. An abnormal fragment marks the end of the chain. So: chain ends in AA and begins in C.