300 likes | 408 Views
“Computers are to Biology what Mathematics is to Physics” - Harold Morowitz. Corollaries:. A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution.
E N D
“Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: • A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution. • A Biologist who does not understand algorithmics and the construction of algorithms will be severely handicapped in doing the Biological Research of the future.
Report of the National Reseearch Council National Academy of Sciences 2002 “Biological concepts, models, and theories are becoming more quantitative, and the connections between the life and physical sciences are becoming deeper and stronger” “Computers now play a central role in the acquisition, storage, analysis, interpretation, and visualization of vast quantities of biological data”
Some Descriptions of Bioinformatics • (A Computer Scientist, 2005) Bioinformatics is a study of the algorithms and programs that are used by Molecular Biologists and others in the Biological and Medical Sciences in their quest for understanding protein structure and function in living organisms • (Claverie & Notredame, 2003) Bioinformatics is nothing but good, sound, regular biology appropriately dressed so that it can fit into a computer. • (Attwood & Parry-Smith, 1999) The current research drive is to be able to understand evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought to bear on the problem: tackling it from the prespectives of sequence analysis and structure analysis, respectively. • (Augen, 2005) Bioinformatics lies at the intersection of Information Science and Molecular Biology. Furthermore, it’s development is highly dependent on simultaneous technical advances in both areas. • (Pevsner, 2003) Bioinformatics is designed to help the biologist use computer programs and databases to solve biological problems related to proteins, genes, and genomes with a larger goal of understanding broader issues such as the relationship of structure to function, development, and disease. • (Krane & Raymer, 2003) Bioinformatics strives to determine what information is biologically important and to decipher how it is used to precisely control the chemical environment within living organisms.
In Lab 2 we analysed a single sequence of cDNA. We determined the number of each type of nucleotide making up the sequence and also looked for repeated sequences. This type of analysis is useful for identifying key subsequences, determining the makeup of the sample, and searching for genetically caused diseases. It may also serve to help explain the protein that is manufactured by this sample of DNA, In Lab 3 we began our consideration of the problem of comparing two DNA samples. This comparison can lead to identification of the query sequence, finding homologous subsequences within the sequence, identifying strongly conserved regions, and establishing an evolutionary relationship between some of the genes contained within the given sample. Unfortunately, the problem is not a simple one. For example, where do we begin the comparison? What about the nucleotides between conserved regions?
An Analogy Information is stored on a floppy disc as a sequence of 0’s and 1’s. We may be able to read this code and still not tell what is on the disk. Two discs may contain the same files and programs yet the patterns of 0’s and 1’s may be entirely different. WHY? Information is not stored as single sequential files. As files become written and erased, new files are placed in ‘holes’ and maybe distributed in several of these available spaces.
Sequence Comparison A segment of a blast report: Score: 76 What does this number mean? Where did it come from? How good a score is this? We will back off a little and work up to answering these questions.
First Attempt at Alignment – Dot Plots • Basic Idea – • Form a rectangle. • Place the first sequence along the right hand side of the rectangle and the second along the top. • Start with the character at the top of the first sequence place a dot in the rectangle for every character in the second sequence that matches that character from the first sequence. • Examine the result to determine if there are sequences of consecutive characters that match. • If there are such sequences, align them. • Use mismatches and/or gap to match as many matching sequences as possible
Well, That’s the Idea, But ….. The following is the exact implementation of the previous slide. Recall black pixels mean that there was a match. What have you learned from the plot about the sequence alignment?
The two sequences that were aligned were: Escherichia coli O157:H7 mutS And Shigella boydii strain 374(S37) phosphoprotein phosphatase B (prpB) gene, partial sequence The second is a well-conserved homolog of a subsequence of the first. If the Dot Plot method is to be of any value, it should make this fact obvious. The plotting method is modified by using a “sliding window” of some specified length. If the window is of length n the point is only plotted if the character and its next n – 1 characters match. What is a good value for n? There is no consensus. Some other methods use as the default value of 11 since (1/4)11 = .000000238… Which would be the probability that the match of 11 characters would occur at random. The Dot Plot Tool that is part of Mol Kit at Colorado State uses n = 9.
Here is the Dot Plot with n = 11: In this case the relationship is obvious. So the moral of the story is that it is not just the tool, it is how skilled you are at using it.
Let’s do something a little more simple That may also involve some gaps, i.e., indels. How about the sequences ACACTG and ACACTGATCG ? GACGGATTAG and GATCGGAATAG ? Looks like an alignment of ACACTG- - - - ACACTGATCG Is best. Gaps are used to fill out the end of the sequence. One possibility GA- CGGATTAG GATCGGAATAG Note without the gap nothing after GA would match up.
In the last two examples the gaps played roles that are significantly different In the first: ACACTG- - - - ACACTGATCG the gaps probably denote that the quite possibly the first sequence was terminated prematurely In the second: GA- CGGATTAG GATCGGAATAG the gap denotes either an insertion or a deletion. These are called an “indel” ‘s. Note there is also a possible mutation four characters from the end.
Dealing with Gaps Consider the two sequences: AATCTATA and AAGATA These sequences have different lengths – 8 versus 6 If we insist that the sequences do not contain any gaps, i.e. are ungapped, then there are only 3 ways to align them. AATCTATA AATCTATA AATCTATA AAGATA AAGATA AAGATA However, if we decide to put gaps in the smaller sequence to have the two sequence sizes match up. We now have 8 spaces to place the two gaps (note they may be separated). This means there are 8 choose 2 ( C8,2) or 28 possible placements for the gaps and, thus, 28 different possible alignments. Note: Cn,k=n!/(k!*(n-k)!) where n! = n*(n-1)*(n-2)*….*3*2*1. If we were to in addition allow 3 gaps in the top sequence, we would now have C11,3*C11,5 = 76,230 candidates for alignments. The inclusion of gaps makes the problem much larger. But, it is an unavoidable necessity.
Three of the 28 possible alignments with gaps allowed in the shorter sequence. AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - When scoring these alignments we need to assign a “penalty” for these gaps or ‘indels’. One possibility is to give a -1 for each gap character in the alignment. Another is to consider starting a gap to be more serious than allowing a gap to continue, i.e. G- AT -A is considered a more serious difference than G- -ATA For example, we might assess a penalty of -5 for starting a gap and only -1 for continuing it. In the above example the first instance would have a penalty of -10 and the second, -6.
Strategy 1 The Identity Matching Scheme This is essentially a “no penalty for mismatches” scheme Penalty for indel = -1 Scores for our previous alignment candidates: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 1+1+0-1+0+0 -1+1 = 1 1+1+0-1-1+1+1+1= 3 1+1+0+0+1+1-1-1= 2 Or, if we do not assign a penalty for ending gaps 1+1+0+0+1+1 = 4
Discussion of Strategy 1 • The second alignment has the highest score • Problem is: what does this score represent? • There is no recognition of the quality of the mismatch • There is no reference to the history of the two sequences taken in to account. We just have assigned numbers • May not be at all appropriate for a situation where we have a quickly mutating organism, such as a virus. In such an organism substitutions are common – given site may undergo several changes.
Strategy Due to Jukes/Cantor Make the assumption that any nucleotide is equally likely to change into any of the other nucleotides. Jukes and Cantor made no provision for indels. Others have added a gap penalty, but no consensus as to what that should be. We will arbitrarily choose -8. Scoring the three alignments: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 5+5-4-8-4-4-8+5 = -13 5+5-4-8-8+5+5 = 0 5+5-4-4+5+5-8-8 = -4 OR if we ignore gaps at the end of the sequence: 5+5-4-4+5+5 = 12
Discussion of Jukes/Cantor • The scoring of the alignments yielded exactly the same ranking as was the case for the identity matrix score • The “quality” of the scores is different – It is based on observations of frequency data of “known” sequences. • Assumption that all mismatches are equally likely is not supported by our previous lectures or by examining larger, more recent, databases of known sequences where ancestral data is known. • No standard way to choose a gap penalty.
The Kimura Two Parameter Model Based on assigning different probabilities to different types of matches, i.e., purine – purine, pyrimidine - pyrimidine as opposed to purine – pyrimidine. Once again, the issue of gaps is decided arbitrarily. We will use -8, as in the previous case Scoring the three sequences once more: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 1+1-5-8-5-5-8+1 = -28 1+1-5-8-8+1+1+1 = -16 1+1-5-5+1+1-8-8 = -20 OR 1+1-5-5+1+1 = -6
Discussion of the Kimura Matrix • More realistically set up – uses more recent (1980) database information to calculate frequency probabilities and assign scoring values. • Seems to have a propensity towards negative scores • Still no resolution of the “proper” weighting for gaps • Note the rankings of the three sequences still maintained the same order. • Again, the issue is the quality of the rankings.
No matter what scoring scheme is used the basic problem is simply this: Given two strands of DNA, how can we find the best possible alignment for these two sequences? We need to follow certain rules: • Must use a “sensible” scoring matrix and gap penalty • The order of the characters in the sequence can not be changed, but gaps can be inserted between them. • Gaps appear in either of the sequences. • The score for the pairing of two characters, one from each sequence or a gap with a character from the other sequence, must be done so that the score to that point is the best possible score.
How do we implement the last rule? • In Mathematics this is called “The Principle of Optimality” or POO (The method of problem solving using POO is called Dynamic Programming). • How can the score for a particular position within the alignment be calculated? • We can pair the two available characters • This adds the score for a match or mismatch to the previous score • We can pair the character in the first string with a gap • We can pair the character in the second string with a gap • Either of these add the gap penalty to the previous score Scorei = Scorei-1 + s(a,b) Scorei = Scorei-1 +s(a,-) Scorei = Scorei-1 + s(-,b)
The Three Ways To Score A Position Gap in second sequence Pair the two symbols Gap in first sequence First make a table with the first sequence along the right and the second along the top. Calculate the three possible values for each cell. Choose the largest of these values.
Let’s try one The negative numbers along the top and down the right are the gap penalties for leading gaps in either of the sequences. Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1
Calculating a Cell Value The negative numbers along the top and down the right are the gap penalties for leading gaps in either of the sequences. 1 -2 -2 Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1
Filling in the row and column Indicate the arrow that gave the highest score for each cell. Dynamic Programming records the best score and where it came from. This will be important when the table is filled Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1
The Complete Matrix Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1
Dynamic Programming Finds the Best Score and the Corresponding Alignment Alignment: Start in lower right corner and work backwards: AC- - TCG ACAGTAG
Rules to Discover The Alignment • Start in the lower right box – this box contains the best alignment score for the two sequences relative to this particular scoring scheme. NOTE: This may NOT be the largest value in the table, but it is the best score for completely aligning the two sequences. All other scores in the table are for partial alignments of the sequences. • Work backwards following the arrows from the present box in reverse order. • Diagonal arrow is a pairing of the characters • Vertical arrow represents a gap in the sequence across the top • Horizontal arrow represents a gap in the sequence along the side.