DNA sequences alignment measurement

DNA sequences alignment measurement Lecture 13

Introduction • Measurement of “strength” alignment • Nucleic acid and amino acid substitutions • Measurement of alignment gaps

Measurement of aligned sequences • When aligning sequences (DNA/AA ) it is assumed that: • they have a common ancestor; • the differences between the sequences is the result of mutations • important areas like coding sequences (CDS) will be conserved. There is a bias “against” mutations in these areas • Furthermore there is a bias in the types of mutations: substitutions more likely that insertions/deletions…. • The dot plot gives a visual representation of sequence alignment regions. But how do we measure the strength of these alignments.

Measurement of aligned sequences • One way is to count the mismatches: the “difference” between the sequences. • Hamming distance; : • The distance corresponds to mismatches for strings of equal length. • agtc • cgta Distance is 2 (give another example) • If the sequences (strings) are not of equal length then use: • The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another: • ag- tcc • cgctca what is the levensthein distance? • The latter technique has the advantage of allowing the inclusions of gaps

Measurement of matching • But what about the biological plausibility of these approaches to measuring “differences” between sequences (strings): • DNA sequences (string mismatches) are different: • due to the probability of substitution; insertions, deletions is not the same. • Certain types of mutations like inversions; translocations; duplications …. Complicate the assessment of similarity; e.g. how would you treat tandem repeats; inverted repeats….

Nucleic Acid mutations • In sequence alignment we are trying to determine have the differences (similarity) occurred due to: • chance (random mutations) • They had a common origin (degree of conservatism) • One approach would be to count the percentage of matches but there is now a need to include the bias associated with possible substitutions. • However, similarity does not necessarily imply common ancestor or visa versa Zvelebil and Baum (2008 p. 74) suggest this can occur in convergent evolution/divergent evolution. • So the results need to be contextualised the findings of alignment tests. (bat and bird both have wings…)

Alignment Scoring methods • In general sequences are given a score at each matching position and the one with the largest score is optimal and is chosen; however suboptimal may also need to be considered. • The most basic approach is obtained by measuring the percentage of similarity. • Given that not all “changes” occur with equal chance there is a need to develop: • A nucleotide substitution matrix

Nucleotide scoring Matrix • While it is know that certain mutations are more likely to occur than others: e.g. transitions a<->g is more common than transversions c<->t. • However since the probability of such difference is insignificant in relation to the chance of a mutation itself the differences are mostly ignored. The following shows a typical scoring matrix for nucleotides. Adapted from Baxevanis p. 303

Nucleic acid scoring Matrix • The values are based on the probability of a type of substitution occurring (expected value); this includes a nucleotide substituting with itself. • These expected values are calculated by getting the ratio of : • number of “observed changes” /number of changes “due to chance” • These values are obtained by examining large numbers of DNA sequences.

Nucleic acid scoring Matrix • Then calculate 10*log 10 (“expected value”). • This ensures that adjacent nucleotides expected values can now be added as opposed to being multiplied in determining the alignment score.

Nucleic acid scoring Matrix • A expected value greater of 1 indicates the substitution has the same change of occurrence as it is was occurring randomly. • A value greater than 1 indicates a bias in favour or the substitution • A values less than 1 indicates a bias against the substitution. • A value of 5 will give what expected value?

Measuring Protein similarity • Deriving a matrix for proteins is more complex because: • There are 20 amino acids so much higher set of substitutions. • The amino acids have properties that affect the structure and so the protein functionality. • Therefore substitutions can be conserved or semi-conserved • Observations shows that conserved substitutions • e.g. Hydrophobic <-> hydrophobic mutations are more common • semi conserved; e.g. hydrophilic <-> hydrophobic

Dot plot Matrix: imperfect match • Some alignments require gaps to increase the matching score; the gaps are used represent inclusion/deletion mutations • The diagram shows that most of the 2 sequences are aligned. Where there are gaps indicates areas of non-alignment or mismatches: gaps or substitutions Adapted from: dotplot example

Measurement of alignment gaps • Gaps represents insertions and deletions • Baxevanis (2005) suggest that no more than “one gap in 20 pairs is a good rule of thumb”. • Gaps in alignments are penalised; given a negative scoring value. • The penalty associated with the using gaps is dependent on • Opening the gap (introducing an insertion or deletion) • Extending the gap (as opposed to opening a new gap) • The length of the gap (the number of deletions/insertions).

Gap penalties • There is no overall agreement on what values should be assigned to gap penalties (Zvelebil e Baum 2008). • The purpose of an insertion is to increase the strength of the alignment. • So choosing a high score will eliminate sequences with gaps while of the score is too low then alignments with more and larger gaps will be chosen. • The value should also be dependent on how closely “related” the alignments must be : • So sequences with a very strict match would use a high gap score. • Alignment between distantly related species would use a low gap score.

Potential Exam Questions • What is the purpose of measuring the strength of an alignment (3 marks) • Explain two differences between analysing a string (sequence) and a DNA string. (4 marks) • Describe how you would measure the similarity between two DNA sequences (10 marks) • Discuss the use of gap penalties in a sequence alignment score (13 marks)

References • Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley • Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press • Zvelebil e Baum (2008) Understanding Bioinformatics

DNA sequences alignment measurement

DNA sequences alignment measurement

Presentation Transcript

Resolving ambiguity in DNA sequences

Troubleshooting DNA Sequences:

GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Alignment of Long Sequences: LAGAN

Predictive Methods Using DNA Sequences

Sequence Alignment in DNA

DNA Sequences

Using DNA sequences

: Determining DNA sequences

Reading DNA Sequences

Alignment of large genomic sequences

Sequences Alignment Statistics

Multiple Sequences Alignment

: Determining DNA sequences

Alignment of Genomic Sequences

Pairwise alignment of DNA/protein sequences

DNA Sequence Alignment

DNA Sequence Alignment

DNA Sequence Alignment

Alignment of large genomic sequences