Pairwise Alignments and Database Searches: Algorithms

Pairwise Alignments andDatabase Searches: Algorithms • Ricardo González Méndez • rgonzalez@rcm.upr.edu • Hugh B. Nicholas Jr. • nicholas@psc.edu • Alexander J. Ropelewski • ropelewski@psc.edu • David W. Deerfield II Additional Information: • http://www.nrbsc.org/ • Workshops • On-line tutorials • National Resource for Biomedical Supercomputing

CURATED DATASET: • All related sequences sharing a common function • All substantial motifs • Evolutionary history • Structural information • Experimental information Goal of Sequence Analyses

Structural Libraries Evolutionary Analysis Homology Modeling CURATED DATASET Classification Libraries Multiple Sequence Alignment Initial Query Profile & PSSM Hidden Markov Model Sequence Libraries Local Patterns The Process

DEFINE THE PROBLEM! Get New Sequence Get Sequences from Database Literature Search Single Sequence Analysis Database Search Multiple Sequence Alignment Database Search for Distantly Related Sequences Examine Conserved Patterns Find Distinctive Subfamily Patterns Integrate Patterns with Structure Sequence Analysis - Overview

What are we doing? • Database search • Separate the library sequences that are related to your query sequences (by some model) from the sequences that are not. • Sequence database searches are carried out by doing PAIRWISE SEQUENCE ALIGNMENTS • A pairwise alignment represents our best guess of the history of substitutions at each position.

Why align sequences? • Lots of sequences with unknown structure and function. A few sequences with known structure and function. • If they align, they are similar, maybe due to common descent. • If they are similar, then they might have the same structure or function. • If one of them has known structure/function, then alignment to the other yields insight about how the structure or function works. From Russ Altman - Stanford University

What is a sequence alignment? • THE PROBLEM: • Given two sequences of letters, and a scoring scheme for evaluating matching letters, find the optimal pairing of letters from one sequence to letters of the other sequence. • Align: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. THIS IS A SHORT SENTENCE. • Alignments: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. THIS IS A ######SHORT## SENTENCE##############. OR THIS IS A SHORT#########SENTENCE##############. From Russ Altman - Stanford University

Alignments - Scoring and gap effects • Exact Matches OK, Inexact Costly, Gaps cheap. This is a rather longer sentence than the next. This is a ############# sentence##############. OR This is a *rather longer sentence than the next. This is a s###h####o###rtsentence##############. • Exact Matches OK, Inexact Moderate, Gaps cheap. This is a rather longer sentence than the next. This is a ##short###### sentence##############. • Exact Matches cheap, Inexact cheap, Gaps expensive. This is a rather longer sentence than the next. This is a short sentence.###################### From Russ Altman - Stanford University

Database Searching: An Overview • Database searching is the application of previously determined experimental knowledge to the problem of discovering the biochemistry and physiology of a newly discovered gene or its protein product. • This involves finding a homologous gene or protein - one related to our newly determined sequence through a common evolutionary ancestor.

Types of Database Searches • Sequence against database of abstractions from sequences • Classification Libraries • Sequence against database of sequences • NCBI Databases, UniProt, iProClass • Abstraction from your sequence against a database of sequences • PSI-BLAST • Sometimes against a database of abstraction from sequences

Knowledge in Database Searching • Search programs: Smith-Waterman, BLAST, FASTA • Each is a mathematical method for finding the best fit between two sets of discreet data (e.g., biological sequences) • Any limitations in finding the best fit between sequences can be interpreted as restrictions on the model of evolution relating the sequences and what kinds of changes are allowed. • Similarity scores for alignments: PAM, Blosum, DNA • For a given evolutionary divergence what substitutions are likely to be observed and which are unlikely to be observed

Sequence Databases • GenBank, PIR’s iProClass, Swiss-Prot, UNIPROT, • Databases that contain most of the available sequence information • Many sequences where we already have knowledge of their biochemistry or physiology • Conserved features in the sequence and their functional role • Specialized databases: • Many genome projects have their own database/knowledge base web sites that allow sequence searching using some version of BLAST, and text searching capabilities.

Database Searching: Classification Libraries • Allow you to classify your sequence against other sequences, sequence motifs, Profiles, HMM’s of protein families, etc. • Pfam, PANTHER, SMART, SUPERFAMILY - databases of HMM’s to do protein classification • Blocks, Prints, Prosite - databases of models of motifs that look for small motifs that are conserved and/or associated with particular functions or biochemistry • Other specialized databases exist such as a G-protein coupled receptor database, a calmodulin binding motif prediction database, etc.

InterPro Resource at EBI • InterPro is a database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences. • http://www.ebi.ac.uk/interpro/ • InterProScan is the tool at EBI to do the serach on your protein sequence. • http://www.ebi.ac.uk/Tools/InterProScan/

Sequence Evolution • Homology • The sequences being sought with the query sequence have an ancestral sequence in common from which they and the query sequence have descended. • Parsimony • Our best guess of the actual path of evolution is the path that requires the fewest evolutionary events. • All substitutions are not equally likely and should be weighted to account for this. • Insertions and deletions are less likely than substitutions and should be weighted to account for this.

Descendant 1: gctggaaggca-t Descendant 2: gc-ag-a-gcact Database Searching: Reversibility Unobserved Ancestral Sequence Descendant 1 gctggaaggcat Descendant 2 gcagagcact

Database Searching: The Choices • The choice of search algorithm determines which, if any, restrictions you place on the evolutionary model of the database search; this determines the sensitivity, selectivity, and computational requirements • The choice of scoring matrix (similarity matrix) determines the degree of divergence and pattern of substitutions that will be found • The choice of gap penalty influences the shape of the distribution of alignment scores for random sequences and hence your ability to distinguish homologous from non-homologous sequences

Database Searching • Algorithms • Dynamic Programming (Needleman-Wunsch, Smith-Waterman, Sellers) • Heuristics (FASTA, BLAST) • Similarity Scores • Information theory (log-odds) and entropy (information content in bits) • Approaches for creating matrices (PAM vs. BLOSUM, Protein vs. Nucleic acid) • Efficiency – multiple matrices? • Gaps • InDel = Insertion/Deletion • Statistical Significance • Database as a reference distribution • Information content of the alignment

Database Searching: Algorithms • Dynamic Programming Algorithms • required computational resources are proportional to the products of the lengths of the sequences • more sensitive, less selective • rigorous - finds the best possible alignment given a set of scoring matrices • Needleman-Wunsch (global) • Smith-Waterman (local) • Sellers (quasi-global) • Heuristic Algorithms • less sensitive, more selective, not rigorous - local minima problem • local alignments only • FASTA (Word search) • BLAST (Maximal Segment Pairs)

Dynamic Programming Algorithms • Place no restrictions on the model of evolution • Mathematically rigorous -- finds maximum similarity • Computationally demanding • CPU scales 3 * length sequence 1 * length sequence 2 • Memory scales length sequence 1 * length sequence 2 • Algorithmic Variants • Global -- Needleman-Wunsch; aligns entire lengths of sequences • Quasi-global -- Sellers; fits a short sequence into a longer sequence • Local -- Smith-Waterman; finds most similar regions

Dynamic Programming:Needleman-Wunsch • Aligns the entire lengths of a pair of sequences, thus assumes they are related over their entire lengths • Places the fewest restrictions on model of sequence evolution • Mathematically rigorous • Computationally intensive

Dynamic Programming: Needleman-Wunsch NWi,j = max{NWi-1,j-1 + s(ai,bj); NWi-1,j + g; NWi,j-1 + g } • NWi,j is the cumulative similarity of the alignment up to • residue i in sequence A and residue j in sequence B • s(ai,bj) the similarity score for aligning residue ai with • residue bj (e.g., from a PAM of Blosum table) g is the gap penalty, g = open gap + extend gap * gap length

a5 a6 a1 a2 a3 a4 (open gap + 3 * extend gap) b1 b2 (open gap + 2 * extend gap) NW3,3 + s(a4,b4) b3 (open gap + extend gap) b4 NW4-l,4 + g; l = 1, 2, or 3 b5 Computation Grid:Needleman-Wunsch

Needleman-Wunsch Alignments PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7 0 G C T G G A A G G C A T 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70 -77 -84 G -7 5 -2 -9 C -14 -2 10 A -21 -9 G -28 A -35 G -42 C -49 A -56 C -63 T -70

0 G C T G G A A G G C A T 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70 -77 -84 G -7 5 -2 -9 C -14 -2 10 A -21 -9 G -28 A -35 G -42 C -49 A -56 C -63 T -70 [NWT,C] { NWC,G+s(T,C)=-2-4=-6 NWT,C= MAX NWT,G+Gap=-9-7=-16 NWC,C+Gap=10-7=3 NWT,C= 3 Needleman-Wunsch Alignments PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7

0 G C T G G A A G G C A T 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70 -77 -84 G -7 5 -2 -9 C -14 -2 10 3 A -21 -9 G -28 A -35 G -42 C -49 A -56 C -63 T -70 { [NWC,A] NWG,C+s(C,A)=-2-4=-6 NWC,A= MAX NWC,C+Gap=10-7=3 NWG,A+Gap=-9-7=-16 NWC,A = 3 Needleman-Wunsch Alignments PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7

0 G C T G G A A G G C A T 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70 -77 -84 G -7 5 -2 -9 C -14 -2 103 A -21 -9 3 G -28 A -35 G -42 C -49 A -56 C -63 T -70 { [NWT,A] NWC,C+s(T,A)=10-4=6 NWT,A= MAX NWT,C+Gap=3-7=-4 NWC,A+Gap=3-7=-4 NWT,A = 6 Needleman-Wunsch Alignments PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7

Needleman-Wunsch Alignments PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7 0 G C T G G A A G G C A T 0 0 -7 -14 -21 -28 -35 -42 -49 -56 -63 -70 -77 -84 G -7 5 -2 -9 -16 -23 -30 -37 -44 -51 -58 -65 -72 C -14 -2 10 3 -4 -11 -18 -25 -32 -39 -46 -53 -60 A -21 -9 3 6 -1 -8 -6 -13 -20 -27 -34 -41 -48 G -28 -16 -4 -1 114 -3 -10 -8 -15 -22 -29 -36 A -35 -23 -11 -8 4 7 92 -5 -12 -19 -17 -24 G -42 -30 -18 -15 -3 9 3 5 70 -7 -14 -21 C -49 -37 -25 -22 -10 2 5 -1 1 3 5 -2 -9 A -56 -44 -32 -29 -17 -5 7 10 3 -3 -1 10 3 C -63 -51 -39 -36 -24 -12 0 3 6 -1 2 3 6 T -70 -58 -46 -34 -31 -19 -7 -4 -1 2 -5 -2 8 G C T G G A A G G C A - T G C - A G - A - G C A C T

PT Gla Domain FIX Kringle Domain FX Trypsin Domain FVII EGF Domain PC Repeat FXI Fibronectin (Type I) FXII Fibronectin (Type II) Signal peptide plus propeptide Coagulation proenzymes

Global vs. Local Alignment • Global alignment: find alignment in which total score is highest, perhaps at expense of areas of great local similarity. • Local alignment: find alignment in which the highest scoring subsequences are identified, at the expense of the overall score. • Local alignment can be done with minor modifications of the global alignment algorithm! From Russ Altman - Stanford University

Dynamic Programming:Smith-Waterman • Finds “local alignments” - the best matching regions in a pair of sequences rather than aligning their entire lengths. Ignores poorly matching regions • Solves the problem of aligning sequences that share a functional or structural domain in common but are not related over their entire length • Mathematically rigorous • Computationally intensive

Dynamic Programming:Smith-Waterman SWi,j = max{SWi-1,j-1 + s(ai,bj); SWi-1,j + g; SWi,j-1 + g; 0} • SWi,j is the cumulative similarity of the alignment up to • residue i in sequence A and residue j in sequence B • s(ai,bj) the similarity score for aligning residue ai with • residue bj (e.g., a PAM of Blosum table) g is the gap penalty, g = open gap + extend gap * gap length 0, zero, prevents poorly matching regions from hiding good matching regions

a5 a6 a1 a2 a3 a4 (open gap + 3 * extend gap) b1 b2 (open gap + 2 * extend gap) SW3,3 + s(a4,b4) b3 (open gap + extend gap) b4 SW4-l,4 + g; l = 1, 2, or 3 b5 Computation Grid:Smith-Waterman

Smith-Waterman Alignment PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7 0 G C T G G A A G G C A T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 0 0 5 5 0 0 5 5 0 0 0 C 0 0 10 3 0 1 1 0 0 1 10 3 0 A 0 0 3 6 0 0 6 6 0 0 3 15 8 G 0 5 0 0 11 5 0 2 11 5 0 8 11 A 0 0 1 0 4 7 10 5 4 7 1 5 4 G 0 5 0 0 5 9 3 6 10 9 3 0 1 C 0 0 10 3 0 2 5 0 3 6 14 7 0 A 0 0 3 6 0 0 7 10 3 0 7 19 12 C 0 0 5 0 2 0 0 3 6 0 5 12 15 T 0 0 0 10 3 0 0 0 0 2 0 5 17 [ ] G A A G - G C A G C A G A G C A

SW Alignment – Waterman-Eggert Extension (MaxSegs) PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7 0 G C T G G A A G G C A T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 0 0 5 0* 0 0 5 5 0 0 0 C 0 0 10 3 0 1 0* 0 0 1 10 3 0 A 0 0 3 6 0 0 6 0* 0 0 3 15 8 G 0 5 0 0 11 5 0 2 0* 5 0 8 11 A 0 0 1 0 4 7 10 5 0* 0 1 5 4 G 0 5 0 0 5 9 3 6 10 0* 0 0 1 C 0 0 10 3 0 2 5 0 3 6 0* 0 0 A 0 0 3 6 0 0 7 10 3 0 2 0* 0 C 0 0 5 0 2 0 0 3 6 0 5 0 0 T 0 0 0 10 3 0 0 0 0 2 0 0 0 [ ] 1ST (Score = 19): G A A G - G C A G C A G A G C A 3RD (11): G C T G G C A G 2ND (15): G C A G C A

Sellers (Quasi-Global) Alignment PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7 0 G C T G G A A G G C A T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 5 -2 -4 5 5 -2 -4 5 5 -2 -4 -4 C 0 -2 10 3 -2 1 1 -6 -2 1 10 3 -4 A 0 -4 3 6 -1 -6 6 6 -1 -6 3 15 8 G 0 5 -2 -1 11 4 -1 2 11 4 -3 8 11 A 0 -2 1 -6 4 7 9 4 4 7 0 1 4 G 0 5 -2 -3 1 9 3 5 9 9 3 -4 -3 C 0 -2 10 3 -4 2 5 -1 2 5 14 7 0 A 0 -4 3 6 -1 -5 7 10 3 -2 7 19 12 C 0 -4 1 -1 2 -5 0 3 6 -1 3 12 15 T 0 -4 -6 5 -2 -2 -5 -4 -1 2 -4 5 17 [ ] NWi,j = max{NWi-1,j-1 + s(ai,bj); NWi-1,j + g; NWi,j-1 + g } G A A G - G C A - T G C A G A G C A C T

[ [ [ [ [ ] ] ] ] ] Comparison of Alignments from the Algorithms PAM 47, Match = 5, Mismatch = -4, Open Gap = 0, Extend Gap = -7 0 G C T G G A A G G C A T 0 G C A G A G C A C T Needleman-Wunsch (global) G C T G G A A G G C A - T G C - A G - A - G C A C T Smith-Waterman (local) 1ST (Score = 19): G A A G - G C A G C A G A G C A 2ND (15): G C A G C A 3RD (11): G C T G G C A G Sellers (Quasi-Gloval) G A A G - G C A - T G C A G A G C A C T

Comparison of Algorithms • Needleman-Wunsch • Global alignment • Must start in upper left corner and finish in lower right corner. • Smith-Waterman • Local Alignment • Small regions of high similarity • Sellers • Quasi—global alignment • Must start on left or top edge and finish on right or bottom edge • In well behaved cases, forces an alignment of smaller sequence into larger sequence Sequence 1 Sequence 2 Global Local-1 Quasi- global Local-2

Heuristic Algorithms • Fast initial screen of each database sequence to see if the similarity with the query sequence is high enough to warrant a more complete examination. • The initial screen is based on finding highly similar, small, fixed length segments of sequence called “words” or “tuples”. • This initial screen can be thought of as a restriction on the model of sequence evolution relative to that embodied in dynamic programming algorithms. • Entail some loss of sensitivity and sometimes a gain in selectivity over dynamic programming. • Algorithmic Variants: FASTA, BLAST

Generalized Heuristic Algorithm Structure • Initial Word Search • Exact • Similarity • First evaluation: Threshold length or score • Initial Alignment • Join words or • Extend words toward maximal segment pair • Final, optimized alignment • Join MSPs • Banded Smith-Waterman (Fasta, prior BLASTs) • “Centered” Smith-Waterman (BLAST 2)

FASTA • Restricts the evolutionary model to require a few unmutated dipeptides or hexanucleotides • Carefully considers only alignments including the unmutated sites • The window size limits the maximum cumulative gaps • Much faster, more selective, and less sensitive than Smith-Waterman; searches for sequences with several identical dipeptides or hexanucleotides

A Word List for FASTA, Word Size = 6 G C T G A A G G C A T G C T G A A C T G A A G T G A A G A G A A G G C A A G G C A A G G C A T

FASTA Algorithm • 4 steps: • use lookup table to find all identities at least ktup long, find regions of identities (Fig.A) • rescan 10 regions (diagonals) with highest density of identities using PAM250 (Fig.B) • join regions if possible without decreasing score below threshold (Fig.C) • rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment) (Fig.D) • Pearson and Lippman 1988

BLAST • Requires the evolutionary model to include a site with a long stretch of mostly identities and conservative substitutions, with very few unlikely substitutions and no insertions or deletions • Less sensitive and more selective than Smith-Waterman when the evolutionary model is appropriate • Intended for proteins - BLAST is the least effective algorithm for nucleic acid sequences • The restrictions allow rigorous statistical evaluation of the results of the database search • Recent extension allows insertions and deletions through a second pass Smith-Waterman alignment

BLAST Algorithm • 3 steps: • Compile a list of high-scoring words • scan database for “hits” • extend hits

BLAST Algorithm

BLAST 2 - Improved Sensitivity • BLAST 2 requires two different words to match • Both matches must be on the same diagonal of the path graph • The matches must not overlap • The pair of matches must be within a limited number of amino acids of one another, typically 40 amino acids along the diagonal • This allows the threshold score for each match to be lowered from the value used in the original BLAST • These matches are extended as in the original BLAST • If the resulting HSP score is high enough a limited Smith-Waterman alignment is done.

BLAST 2: Smith-Waterman • Finds the highest scoring 11 amino acid segment in the HSPs from the first pass BLAST2 analysis • Takes the middle aligned pair of amino acids from the segment as the starting point for the Smith-Waterman alignment • Extends the alignment in both directions until the alignment score drop to some limit X below the highest Smith-Waterman score calculated up to that point for the current pair of sequences • This allows odd shaped areas rather than the band used in FASTA and the earlier gapped BLAST

Pairwise Alignments and Database Searches: Algorithms