540 likes | 855 Views
Chapter 3. Pairwise Sequence Alignment. The comparison of sequences is probably the most important bioinformatics analysis First step towards understanding structure and function of a new sequence Most fundamental sequence comparison in sequence alignment
E N D
Chapter 3 Pairwise Sequence Alignment
The comparison of sequences is probably the most important bioinformatics analysis • First step towards understanding structure and function of a new sequence • Most fundamental sequence comparison in sequence alignment • Sequence comparison and alignment gives information on evolutionary relatedness of two sequences • Identification of highly conserved amino acid residues, perhaps involved in enzyme active sites • Place protein of DNA sequence within a family • Allows functional predictions
Sequences that share a common evolutionary ancestor are said to share homology • Sequence similarity is the degree of relatedness between two aligned sequences • For proteins, sequence similarity means conserved amino acids (similar physicochemical properties) at identical positions in the aligned sequence • Sequence identity means identical amino acids at aligned positions • S =[(Ls×2)/(La+Lb)] ×100 • S: percentage sequence similarity • Ls: number of aligned residues • La and Lb: number of residues in sequences a and b, respectively • I =[(Li×2)/(La+Lb)] ×100 • I: percentage identity • I(S)%=Li(s)/La%
Global alignment • Sequences aligned are assumed to be similar over entire length • Single best alignment over entire length • Sequences must be roughly the same length • Highly homologous regions may not be recognized in divergent proteins in an attempt to generate the best global alignment • Local Alignment • Align local areas without regard for rest of sequence • Sequence lengths immaterial • Good to find homologous regions between divergent proteins
Dot Matrix Plots • High noise level (especially DNA; only 4 nucleotides) • Use a sliding window with perfect match to place a dot • Sequence aligned with itself identify internal repeats • Repeats give short diagonals above and below the main diagonal • For proteins can use weighting matrices • Very powerful visual identification of gene duplications or reversals in genome analysis • Lacks statistical rigor • Only applicable to pairwise comparisons Dotmatcher (http://bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html) Dothelix (http://www.genebee.msu.su/services/dhm/advanced.html) Matrixplot (http://www.cbs.dtu.dk/services/MatrixPlot/)
Dynamic Programming Sequences to be aligned: Sequence 1: GAATTCAGTTA Sequence 2: GGATCGA X Z +a 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • Make a matrix with the 2 sequences to be aligned • Start at the top left and work left to right, row by row • Since the introduction of a starting gap in either sequence 1 or sequence 2 is 0 (gap score=0) we can fill the first row and first column with 0
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • Every aligned nucleotide must follow either after a • match/mismatch (diagonal movement, X->A ) • gap in sequence 1 (horizontal movement, Y->A) • gap in sequence 2 (vertical movement, Z-<A) • The value to be written in the yellow box is the sum of either a match or mismatch plus the value in either the X, Y or Z boxes, whatever is the greatest. • must be the maximumAccording to the scoring scheme, the value • Make a matrix with the 2 sequences to be aligned • Start at the top left and work left to right, row by row • Since the introduction of a starting gap in either sequence 1 or sequence 2 is 0 (gap score=0) we can fill the first row and first column with 0
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • Since a G to G match scores 1, and all three the X, Y and Z boxes have a value 0, the maximum value of the yellow box is 1 • The green box is a G to A mismatch (score=0) • The maximum of the movements from the X, Y and Z boxes is the Y->A movement, since the Y box has a value of 1. • Thus, the green box is maximum[1+0, 0+0, 0+0] which is 1
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • By the same logic, all the other boxes in the row has a value of 1 up to the first G to G match (red box) • Since a match has a score=1, the maximum value possible in the red box is 2 (maximum[1+1, 0+1, 0+1] for Y->A, X->A and Z->A movements, respectively)
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • Since the other boxes in the row align G with a mismatched nucleotide (T, T and A; score=0) the values for the remaining boxes are 2.
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • The first block of the next row is a G to G alignment (score=1) • The value is maximum()+1, 0+1, 1+1) =2
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme: Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • The value of each block is similarly calculated, and the row completed
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • The entire matrix is completed
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • The next part is to trace the path backward from the bottom rights, following the largest values • The A to A match (value 9) must have followed either the • G to T mismatch (X box) • Introduction of a gap in sequence 1 (Y box) • Introduction of a gap in sequence 2 (Z box) • Although we can choose either the X, Y or Z box, because they all have the same value, let choose the X box
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • The next value (8) could only have been obtained from intruducing a gap in either sequence I (Z box) or sequence 2 (Y box) • Let’s choose the Y box GAATTCAGTTA | GGATCGA
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • The next value (8) could only have been obtained from intruducing a gap in sequence 2 (Y box) GAATTCAGTTA | GGATCG--A
Dynamic Programming X Z 1 or 0 0 Y A 0 Scoring scheme Match = 1 Mismatch = 0 Gap=0 Gap extension=0 • Following the route of only or possible predecessors to each box gives a path and the alignments below. • There is more than one patch and alignment with an equal score • Note that the best possible (one or more equally scoring) alignments are identified G-AATTCAGTTA | | | | | GGA-T-C-G--A
Dynamic Programming (More complex scoring scheme) X Z 2 or -1 -2 Y A -2 Scoring scheme Match = 2 Mismatch = -1 Gap=-2 Gap extension=0 • OK, lets try a more complex scoring scheme • Alignment of G to G is one of the following • Introduction of gap into sequence 1 (Y->A box): 0 - 2 • Introduction of gap into sequence 2 (Z->A box): 0 - 2 • Match: 0 + 2 • Thus, maximum(-2,-2,+2) = 2 • We now also indicate where we came from in calculating the maximum
Dynamic Programming (More complex scoring scheme) X Z 2 or -1 -2 Y A -2 Scoring scheme Match = 2 Mismatch = -1 Gap=-2 Gap extension=0 • G to A mismatch • Introduction of gap into sequence 1 (Y->A box): 2 – 2 = 0 • Introduction of gap into sequence 2 (Z->A box): 0 – 2 = -2 • Mismatch: 0 – 1 = -1 • Thus, maximum(0,-2,-1) = 0
Dynamic Programming (More complex scoring scheme) X Z 2 or -1 -2 Y A -2 Scoring scheme Match = 2 Mismatch = -1 Gap=-2 Gap extension=0 • G to A mismatch • Introduction of gap into sequence 1 (Y->A box): 0 – 2 = -2 • Introduction of gap into sequence 2 (Z->A box): 0 – 2 = -2 • Mismatch: 0 – 1 = -1 • Thus, maximum(-2,-2,-1) = -1
Dynamic Programming (More complex scoring scheme) X Z 2 or -1 -2 Y A -2 Scoring scheme Match = 2 Mismatch = -1 Gap=-2 Gap extension=0 • G to T mismatch • Introduction of gap into sequence 1 (Y->A box): -1 – 2 = -3 • Introduction of gap into sequence 2 (Z->A box): 0 – 2 = -2 • Mismatch: 0 – 1 = -1 • Thus, maximum(-2,-2,-1) = -1
X Z Dynamic Programming (More complex scoring scheme) 2 or -1 -2 Y A -2 Scoring scheme Match = 2 Mismatch = -1 Gap=-2 Gap extension=0 • Using the same logic, complete the matrix
Dynamic Programming X Z +a 0 Y A 0 • Continue to the bottom right cell • The highest score of the Y→A, X → A and Z → A are written in the A box
Dynamic Programming X Z +a 0 Y A 0 • Trace a path back from the bottom right cell, following the cells with the highest scores ATTG-C A--GGC More than one solution!
Dynamic Programming Global alignmentNeedleman-Wunsch algorithm Sequences of similar length May miss homologous local regions Example: GAP (http://bioinformatics.iastate.edu/aat/align/align.html) Local alignments Smith-Waterman algorithm Trace back from highest to lowest value cell Aim is to get highest local score Alignment of divergent sequences or alignment of sequences with multiple domains Examples: SIM: (http://bioinformatics.iastate.edu/aat/align/align.html) SSEARCH: (http://pir.georgetown.edu/pirwww/search/pairwise.html) LALIGN: (http://www.ch.embnet.org/software/LALIGN_form.html)
Scoring matrices DNA comparisons easy: 4 nucleotides Transitions (G →A, C →T) more like than transversion (G →C, A →T) Proteins composed of 20 amino acids with physicochemical groups Comparisons can be based on giving “conservative” mutations higher scores that “non-conservative” Reflect functionality or folding and structural stabilization Alternative is to base comparisons on empirical data: observed conservation in aligned sequences – PAM and BLOSUM Positive score: greater frequency observed than expected by random mutation Zero: expected by random mutation Negative score: lower frequency observed than expected by random mutation
Log-odds Log2/10 (observed rate/random rate) Alignment shows nine F and one I Thus, observed frequency is 1/10 = 0.1 Random substitution for F by I is 1/20 = 0.05 Thus log2(0.1/0.05) = 1 Twice as likely to be substituted than by random chance
PAM matrices • Alignment of 71 groups of closely related proteins • Since the proteins were related, the observed mutations were not expected to chance function considerably • Thus, “point accepted mutation” • Protein sequences clustered using maximum parsimony • Matrix based on evolutionary divergence of proteins in the same cluster • 1 PAM unit means 1% of amino acids (1 in 100) have mutated • A group of proteins with 1% difference is chosen to construct PAM1 matrix • Number of mutational changes for a residue divided by number of residues in the sequence normalized to random mutation expressed as log10
PAM250 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 -1 -8 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 -8 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -8 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 -4 -8 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 -2 -8 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 -1 -8 B 0 -1 2 3 -4 1 3 0 1 -2 -3 1 -2 -4 -1 0 0 -5 -3 -2 3 2 -1 -8 Z 0 0 1 3 -5 3 3 0 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 -1 -8 X 0 -1 0 -1 -3 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 0 0 -4 -2 -1 -1 -1 -1 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 1
BLOSUM • Black amino acid substitution matrices • Empirically on 2000 amino acid patterns representing 500 groups • Blocks of less than 60 ungapped amino acids (blocks) • BLOSUM number indicated actual shared identity • BLOSUM62 average identity of 62% • Log2 of observed mutation frequency relative to that expected by random mutation
Difference between PAM and BLOSUM • Only PAM1 observed, other derived • All BLOSUM matrices are observed • PAM matrices mostly used to construct phylogenetic trees • BLOSUM based on conserved blocks, more successful for searching conserved regions • BLOSUM outperforms PAM on alignment accuracy • Modern PAM based on larger datasets: Gonnet and the Jones-Taylor-Thornton matrices
Statistical significance of sequence alignment When large numbers of sequences are aligned and the alignment score plotted against the standard deviation off the average, a “Gumbel extreme value” distribution is obtained • Get best alignment between two sequences • Jumble letters of one sequence and recalculate “alignment” score • Do this randomization a large number of times • Use this data top generate a Gumbel extreme value distribution • Compare the first, proper alignment with the distribution. • If the score is located at the extreme margins of the distribution, the alignment is significant • A P-value is given to indicate the probability of the original alignment being due to random chance
Chapter 4 Database Similarity Searching
An important bioinformatics application is to find sequences in a database that are homologous to a query sequence • There are three main concerns • Sensitivity: ability to find correct matches (“true positives”) • Selectivity: ability to ignore incorrect matches (“false positives”) • Speed • Dynamic programming is too inefficient to perform this task • Heuristic programming better at finding an optimal solution • “A 'heuristic (hyu̇-ˈris-tik)' is a method to help solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer. Heuristics are "rules of thumb", educated guesses, intuitive judgments or simply common sense.” – Wiki dictionary
Basic Local Alignment Search Tool (BLAST) Developed by Stephen Altchul, NCBI, 1990 (http://blast.ncbi.nlm.nih.gov/Blast) Create a list of words from the query sequence (“seeding”) For the sequence MRDPYNKLIS, PYN is an example Scan every 3 residues to be used in the search Assuming one of the words finds a match in the database Query PYN PYN PYN PYN … Database PYN PFN PFQ PFE… Calculate sums of matched scores using BLOSUM62 matrix Query PYN PYN PYN PYN … Database PYN PFN PFQ PFE … Sum of score: 20 16 10 10
Find the database sequence corresponding to the highest score, and extend the sequence on both sides Query: MRDPYNKLIS Database: MHEPYNDVPW Determine high score segment above the threshold (default=22 for proteins) Query: M R D P Y N K L I S Database: M H E P Y N D V P W Score: 5 0 2 20 -1 1-3-3 |----------------| High-scoring segment pair (HSP) THE HSP is given in the BLAST report Gapped BLAST allows alignment score to temporarily drop below threshold
Statistical significance in BLAST E-value expectancy value, the likelihood of finding a match by chance E = m×n×P M: total number of residues in the database N: length of the query sequence P: probability that an HSP is due to random chance E < 10-50: extremely likely to be homologous sequences 10-50 < E < 10-2: likely to be homologous sequences 10-2 < E < 10: not significant 10 < E: unrelated Note: The larger the database, the larger E The longer the query sequence the larger E (more likely to find local matches in longer sequences)
BLAST Bit score Database grow daily: significant homologue today but not significant tomorrow? Independent of database size or query length S’ = (×S-LnK)/Ln2 : Gumbel distribution constant S: raw alignment score K: constant related to scoring matrix S’ linearly related to raw alignment score
Low Complexity Regions • Repetitive sequences may be over-represented by small number of residues • Such regions are known as low complexity regions (LCRs) • About 15% of database sequences are LCRs • Can cause spurious matches and high alignment scores • LCRs can be masked • Hard masking • Replace LCR sequences with N or X (nucleotides or protein) • Repeat Masker (http://woody.embl-heidelberg.de/repeatmask/) • Alignment scores of true homolog may be lower • Soft masking • Convert LCR to lower case characters • These characters are ignored with word searching, but used for extension • Mask repetitive sequences with SEG (incorporated into BLAST)
FASTA FAST ALL (http://www.ebi.ac.uk/fasta33/) Unlike “words”in BLAST, FASTA uses “hash table” Sequence 1: AMPSDGL Sequence 2: GPSDNAT Make hash table Identify residues with the identical offset (tuple)
FASTA Select the 10 highest scoring groups of tuples using substitution matrix Tuples on the same diagonal are selected, and a gapped alignment performed with Smith-Waterman FASTX: translated nucleotide query searched against protein database TFASTX: protein query searched against translated nucleotide database FASTA also use E-values and bit scores In addition: Z-score number of standard deviations from the mean for the database search The higher the Z-score, the more significant the match
Difference between BLAST and FASTA • FASTA uses smaller tuple • FASTA more sensitive than BLAST • Slower than BLAST • Masking improves sensitivity in BLAST • BLAST given many best-scoring alignments • FASTA gives only one
Database searching with Smith-Waterman method • Dynamic programming not used for database searching because it is too slow • Heuristics like BLAST and FASTA developed for speed • Heuristic methods have limited sensitivity not guaranteed to find optimal alignment • BLAST can miss 30% of truly significant hits • Needleman-Wunsch and Smith-Waterman have been modified to run in parallel processing environment • General availability of GRID computers may make this a feasible option • Example implementations on the web: • ScanPS (http://www.ebi.ac.uk/scanps/) • ParAlign (http://www.paralign.org/)