Measuring the degree of similarity: PAM and blosum Matrix

Measuring the degree of similarity: PAM and blosumMatrix Lecture 13

Introduction • Measurement of matching • Nucleic acid and amino acid substitutions • The blosumMatrix • The Pam Matrix • Appropriate use of blosumand Pam Matrix • Measurement of alignment gaps

Measurement of matching • The dot plot gives a visual representation of sequence alignment. So how do we measure the alignment. • One way is to count of matches and mismatches: the difference between them • Hamming distance; : • The distance corresponds to mismatches for strings of equal length. • agtc • cgta Distance is 2 (give another example)

Measurement of matching • If the sequences (strings) are not of equal length the use: • The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another: • ag- tcc • cgctca what is the levensthein distance? • But what about the biological plausibility of this approach? Strings are not the same as sequences!!! (hint: amino acid alignment)

Nucleic Acid mutations • It is know that transitions a<->g are more common than transversions c<->t • In sequence alignment we are trying to determine the degree of similarity and not dissimilarity; but the hamming/levenshtein measure dissimilarity. • One approach would be to count the number of matches but there is now a need to include the bias associated with possible substitutions.

nucleic acid scoring table • Based on known rates we could propose, a simple, table like the following: • where the each match scores a 1000 • A transition A<-> G scores a 100 • A transversion T<->C and others score a 10 • The values correspond to the chances of a substitution (no substitution.)

nucleic acid scoring table • Using this we could attempt to calculate the similarity we would look at each sequence and determine the score seq1 1 to seq 2 . • Seq 1: agtc • Seq 2: cgta • 10 1000 1000 10 since the are, we assume, independent elements (events) we have to multiple them to get the score. • LogA+LogB = Log(A*B) • However by get the log of each value we only have to add the values: log10 of about is 8. • What would be the table if log values were used?

Nucleic Acid Matrix So in this case all we have to do is add the values. Note this is example to illustrate the concept. This is not actual substitution matrix for nucleic acids (bases) [it can be found on the internet] . But lesk 2008 p. 255 give an example of one. Measurement of sequence similarity plays a much greater role in assessing proteins. Why do you think the similarity of proteins is more critical than nucleic: (hint: code and AA properties )

Measuring Protein similarity • Deriving a matrix for proteins is more complex because: • There are 20 amino acids so much larger set of substitutions. • The amino acids have properties that affect the structure and so the protein functionality. • Therefore substitutions can be conserved or semi-conserved • Observations shows that conserved substitutions • e.g. Hydrophobic <-> hydrophobic mutations are more common • semi conserved; e.g. hydrophilic <-> hydrophobic

PAM 1 matrix • Pam (PERCENTAGE ACCEPTED MUTATION) 1 is the chance of a one point mutation per 100 residues; in other words a first round of divergence: the above score is dependent on the expected value of occurrence. • Clearly A <-> A, no change, has a high score • A hydrophobic <-> Hydrophobic V<->A (13); while V<-> I is (57) • A hydrophilic <-> hydrophilic K <-> T (11); K<-> R (37) • A hydrophilic <-> hydrophobic: K <-> V (1)

Dayhoff PAM (250) Matrix • THE most common PAM matrix is the 250 • It represents a greater degree of evolutionary divergence and corresponds to multiplying the PAM 1 by itself 250 times via a process called dynamic programming • To dervive the values you use: • Observed rate of mutation/ the random mutation rate (based on the AA frequency. In other words : expected value .(no bias, positive bias or negative bias). • the log of this expected value is multiplied by 10 to give the results in the table opposite. • Therefore a C<->S has a value of 2 or an expected value 1.6 :occurred 1.6 times more often than if it was random.: log((1.6) = 0.2 . Multiply this by 10 gives a value of 2. • The values in the PAM 250 are a obviously lower but the distribution is about the same: why?

blosum62 matrix • Another matrix the blosumMatrix used a larger data set (as there was more information available in 1992 than in 1978) • Moreover the blosumlooked at mutations within blocks of conserved sequences • as opposed to point mutations on individual sequences in both conserved and variable regions. [ what was the logic behind excluded] • The blosum62 matrix, unlike the PAM 250 matrix , the blosummultiplied 250 times, is the probabilities are derived from blocks sharing 62% conservation . • Like the PAM matrix it • Hydrophobic to hydrophobic • V<->A (O) • V<-> I (3) • Hydrophilic to Hydrophilic • K <-> T (-1) • K<-> R (2) • Hydrophobic to hydrophilic • K<-> V (-2)

PAM and blosumMatrices • In the PAM matrix the as the number increases so does evolutionary distance while it is the reverse it the blosumMatrix. • According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matrices • PAM250 and the blosum45 • PAM160 and the blosum62

PAM and blosumMatrix Adapted from Baxevanis 2005 An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000

Measurement of alignment gaps • Gaps represents insertions and deletions • Need to be limited so that they represent biological plausibility. • Baxevanis (2005) suggest that no more than “one in 20 is a good rule of thumb”. • Baxevanis (2005) proposed that the use of gaps in alignments is penalised; in other words the measurement of the similarity reduces. • The penalty associated with the using gaps is dependent on • Opening the gap • Extending the gap • The length of the gap.

The Blast Algorithm • The most widely used approach to determine similarity is the BLAST algorithm. • Basically the algorithm is a combination of the dot plot and one of the scoring matrices: such as blosumor PAM, • Is used to determine the best region of local alignment between the query sequence and target sequences (refer to dot plot example 1 in lecture 12).

Potential Exam Questions • Discuss how to derive both the PAM and blosummatrix and why it is necessary to use different variants ,of each, in determining different types of similarity analysis. • The dot plot and the PAM and Blosummatrices are important tools in the measurement of amino sequences similarity. Discuss the best variant of each that should be used in the determination of sequence alignment similarity. • Distinguish between the two main types of scoring matrices [PAM and blosum] and explain how they are used to measure the amount of similarity between two sequences.

References • Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley • Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press

Measuring the degree of similarity: PAM and blosum Matrix

Measuring the degree of similarity: PAM and blosum Matrix

Presentation Transcript

Matrix Factorization and its applications

Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuk

Hand Tools: Measuring Tools Tape Measure Plastic or metal case, appropriate for general scenic measuring Tri Square

Measuring the Effectiveness of a New Tiered Intervention ZAP Matrix and Tracking Database

Feature Similarity

Feature Based Similarity

Similarity in CBR

The QR iteration for eigenvalues

Musical Similarity: More perspectives and compound techniques

Applying Similarity

COMBINING HETEROGENEOUS MODELS FOR MEASURING RELATIONAL SIMILARITY

Blosum matrices What are they? Morten Nielsen BioSys, DTU

Tutorial 4 Comparing Protein Sequences

Proving Triangles are Similar

Topic 1 Outline

Blosum matrices Morten Nielsen Department of systems biology, DTU

Matrix

NTA results for Spain: Measuring the degree of intervention of the public sector

Sequence comparison: Significance of similarity scores

Introduction 1. Similarity 1.1. Mechanism and mathematical description

Measuring the degree of government intervention on intergenerational family transfers (IFT)

MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES