Pairwise sequence alignment

Pairwise sequence alignment Based on presentation by Irit Gat-Viks, which is based on presentation by Amir Mitchel, Introduction to bioinformatics course, Bioinformatics unit, Tel Aviv University. and of Benny shomer, Bar-Ilan university

Where we are in the course? • Ways to interrogate biobanks: • By identifier-based search (GenBank etc.) • By genome location (genome browsers) • By mining annotation files with scrips • Now: searching by sequence similarity

What is it good for? • Function inference – if we know something about A and A is similar to B, we can say something about B – guilt by association • Conservation arguments – if we know that A and B do something similar, by looking at the conserved segments we can infer which parts of A and B are important for their function • Looking for repeats etc. • Identifying the position of an mRNA/any transcript in the genome • Resequencing • Etc.

Issues with sequence similarity • Things we’re after • A score: how well do two sequences fit? • Statistics: is this score significant or expected at random? • Regions: which parts of the query and the target sequence are actually similar/different? • Next time • How to efficiently search a large sequence database

Start from simple: Dot plots • The most intuitive method to compare two sequences. • Each dot represents a identity of two characters. • No real score/significance, but very easy to assess visually

To Reduce Random Noise in Dot Matrix • Specify a window size, w • Take w residues from each of the two sequences • Among the w pairs of residues, count how many pairs are matches • Specify a stringency

Simple Dot Matrix, Window Size 1

Window Size is 3

Window Size is 3; Stringency is 2

Protein Sequences single residue identity 6 out of 23 identical

Insertion/Deletion, Inversion

ABCDEFGEFGHIJKLMNO tandem duplication compared to no duplication tandem duplication compared to self

What Is This? 5’ GGCGG 3’ Palindrome (Intrastrand)

Identifies low complexity/repeat regions Compare a sequence with itself…

Dotlet example • http://myhits.isb-sib.ch/cgi-bin/dotlet • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=672 • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=353120

Definition Alignment: A matching of two sequences. A good alignment will match many identical (similar) characters in the two sequences VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG

How similar are two sequences? • The common measure of sequence similarity is their alignment score • Simpler measures, e.g., % identity are also common • These require algorithm that compute the optimal alignment between sequences

How to present the alignment? • | - character-wise identity • : - very similar amino acids • . – less similar amino acids • - gap in out of the sequences

Pairwise Alignment - Scoring • The final score of the alignment is the sum of the positive scores and penalty scores: + Number of Identities + Number if Similarities - Number of Dissimilarities - Number of gap insertions - Number of Gap extensions Alignment score

Comparison methods • Global alignment – Finds the best alignment across the whole two sequences. • Local alignment – Finds regions of similarity in parts of the sequences.GlobalLocal _____ _______ __ ____ __ ____ ____ __ ____

Global Alignment • Algorithm of Needleman and Wunsch (1970) • Finds the alignment of two complete sequences:ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Semi-global alignment allows “free ends” GFHKKKADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQJKLLKJ

Local Alignment • Algorithm of Smith and Waterman (1981) • Makes an optimal alignment of the best segment of similarity between two sequences. ADLG CDRYFQ |||| |||| | ADLG CDRYYQ • Can return a number of well aligned segments.

Finding an optimal alignment • Pairwise alignment algorithms identify the highest scoring alignment from all possible alignments. • Different scoring systems can produce different best alignments!!! • Unfortunately the number of possible alignments if pretty huge • Dynamic programming to the rescue

Intuition of Dynamic Programming • Lets say we want to align XYZ and ABC • If we already computed the optimal way to: • Align XY and AB – Opt1 • Align XY and ABC – Opt2 • Align XYZ and AB – Opt3 • We now need to test three possible alignments Opt1Z or Opt2Z or Opt3- • Opt1COpt2-Opt3C • (where “-” indicates a gap). • Thus, if we construct small alignments first, we are able to extend then by testing only 3 scenarios.

Formally: solving global alignment Global Alignment Problem: Input: Two sequences S=s1…sn, T=t1….tm (n~m) Goal: Find an optimal alignment according to the alignment quality (or scoring). Notation: • Let (a,b) be the score (weight) of the alignment of character a with character b. • Let V(i,j) be the optimal score of the alignment of S’=s1…si and T’=t1…tj (0  i  n, 0  j  m)

V(i,j) := optimal score of the alignment of S’=s1…si and T’=t1…tj (0  i  n, 0  j  m) V(k,l) is computed as follows: • Base conditions: • V(i,0) = k=0..i(sk,-) • V(0,j) = k=0..j(-,tk) • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = max V(i-1,j) + (si,-) V(i,j-1) + (-,tj) Alignment with 0 elements  spacing S’=s1...si-1 with T’=t1...tj-1 si with tj. S’=s1...si with T’=t1...tj-1 and ‘-’ with tj.

Costs: match 2, mismatch/indel -1 Snapshot of computing the table Optimal Alignment - Tabular Computation • Use dynamic programming to compute V(i,j) for all possible i,j values: for i=1 to n do begin For j=1 to m do begin Calculate V(i,j) using V(i-1,j-1), V(i,j-1), V(i-1,j) end end

Optimal Alignment - Tabular Computation • Add back pointer(s) from cell (i,j) to father cell(s) realizing V(i,j). • Trace back the pointers from (m,n) to (0,0) • Needleman-Wunsch, ‘70 Backtracking the alignment

Solving Local Alignment • Algorithm of Smith and Waterman (1981). • V(i,j) : the value of optimal local alignment between S[1..i] and T[1..j] • Assume the weights fulfill the following condition: (x,y) =  0 if x,y match  0 o/w (mismatch or indel)

Algorithm - Recursive Definition Base Condition: i,j V(i,0) = 0, V(0,j) = 0 Recursion Step:  i>0, j>0 0, V(i,j) = max V(i-1, j-1) + (si, tj), V(i, j-1) + (-, tj), V(i-1, j) + (si, -) Compute i*, j* s.t. V(i*, j*) = max1i n, 1 j mV(i,j) Computing Local Alignment (2) • A scheme of the algorithm: • Find maximum similarity between suffixes of S’=s1...si and T’=t1...tj • Discard the prefixes S’=s1...si, and T’=t1...tj whose similarity is  0 (and therefore decrease the overall similarity) • Find the indices i*, j* of S and T respectively after which the similarity only decreases.

As usual the pointers are created while filling the values in the table, • The alignments are found by tracking the pointers from cell (i*, j*) until reaching an entry (i’, j’) that has value 0.

Computational complexity • Computing the table requires O(n2) operations for both global and local alignment • Saving the pointers for traceback - O(n2) • But – what if we are only interested in the optimal alignment score? • Only need to remember the last row – O(n) space

Outline • We now figured out • What an alignment is • What alignment score consists of • How to ± efficiently compute an optimal alignment • Still left to figure out • Where do we obtain good σ(i,j) values • When do we use global/local alignment • How to use alignment to search large databases

Scoring amino acid similarity • Identity: Count the number of identical matches, divide by length of aligned region. The homology rule: above 25% for amino acids, above 75% for nucleotides. • Similarity: A less well defined measure • A problematic idea: Give positive score for aligning amino acids from the same group Can we find a better definition for similarity?

Scoring System based on evolution • Some substitutions are more frequent than other substitutions • Chemically similar amino acids can be replaced without severely effecting the protein’s function and structure • Orthologous proteins: proteins derived from the same common ancestor • By comparing reasonably close orthologous proteins we can compute the relative frequencies of different amino acid changes • Amino acid substitution matrices: Families of matrices that list the probability of change from one amino acid to another during evolution (i.e., defining identity and similarity relationships between amino acids). • The two most popular matrices are the PAM and the BLOSUM matrix

PAM matrix • PAM units measure evolutionary distance. • 1 PAM unit indicates the probability of 1 point mutation per 100 residues. • Multiplying PAM1 by itself gives higher PAMs matrices that are suitable for larger evolutionary distance. • JTT matrices are a newer generation of PAMs

PAM 1

PAM 250

Log Odds matrices • The score might arise from bias in amino acid frequency -> We use the log odds of the PAM matrix. (120 PAM)

Rules of thumb • The most widely used PAM250 is good for about 20% identity between the proteins • 40% --> PAM120 • 50% --> PAM80 • 60% --> PAM60

PAM vs. BLUSOM • Choosing n • Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n  smaller evolutionary distance. • Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n  larger evolutionary distance • Blosum matrices are newer (based on more sequences) 62 120 250

Gap parameters • Observation: Alignments can differ significantly when using different gap parameters. • Assumption: For each matrix there are constant default parameters that produce optimum alignments. • Each matrix was checked with different parameters until a “true” alignment was reached. • Where can we obtain “true” alignments? • We can use sequence alignments based on structural alignments. • The structural alignments are “true” for our purpose.

Match DNA scoring matrices • Uniform substitutions in all nucleotides: Mismatch

DNA scoring matrices • The bases are divided to two groups, purines (A,G) and pyrmidines (C,T) • Mutations are divided into transitions and transversions. • Transitions – purine to purine or pyrmidine to pyrmidine (4 possibilities) . • Transversions – purine to pyrmidine or pyrmidine to purine. (8 possibilities). • By chance alone transversions should occur twice than transition. • De-facto transitions are more frequent than transversions • Bottom line: Meaningful DNA substitution matrices can be defined

Mismatch transversion Mismatch transition Match DNA scoring matrices • Non-uniform substitutions in all nucleotides:

Evaluation • Remember that one of our goals was to estimate significance of the scores • How can we estimate the significance of the alignment? • Which alignment is better ? ? Does the score arise from order or from composition ?

Evaluation - bootstrap approach • Data with same composition but different order: • Shuffle one of the sequences. • Re-align and score. • Repeat numerous times. • Calculate the mean and standard deviation of shuffled alignments scores.

Evaluation - bootstrap approach • Data with the same composition but with a different order: Shuffle one of the sequences Compare alignment score with mean of shuffled alignments Align with the second sequence Calculate mean and standard deviation of shuffled alignments

Evaluation • We can compare the score of the original alignment with the average score of the shuffled alignments. • Thumb rule:If:original alignment >>average score + 6*SDThen:the alignment is statistically significant.

Pairwise sequence alignment