370 likes | 396 Views
Sequence Alignment II Gaps, Heuristic Search. Alignment with Gaps – Example. 1. 2. Gaps. Both alignments have the same number of matches and spaces but alignment 2 seems better Definition : A gap is any maximal, consecutive run of spaces in a single string.
E N D
Gaps • Both alignments have the same number of matches and spaces but alignment 2 seems better • Definition: A gap is any maximal, consecutive run of spaces in a single string. • The length of a gap = the number of spaces in it • Example 1 has 11 gaps, example 2 only 2 gaps • Idea: develop alignment scores that take gaps (not spaces) into account
Biological Motivation • Number of mutational events: • A single gap – due to a single event that removed a number of residues • Each gap – due to distinct, independent event • Protein structure: • Protein secondary structure consists of alpha helices, beta sheets and loops • Loops of varying size can lead to very similar structure
cDNA Matching • cDNA is the sequence after splicing (introns have been removed) and editing • We expect regions of high similarity, separated by long gaps
Gap Penalty Models (I) Constant model • Gives each gap a constant score, spaces are free • Maximize: • Time: O(mn) • Works well with cDNA matching Affine model • Penalty for starting a gap + penalty for each additional space • Each gap costs: Wg + qWs • Maximize: • Time: O(mn) • Widely used
Gap Penalty Models (II) Convex model • Each extra space contributes less penalty • Gap function is convex in its length • Example: Ws + log(q) • Time O(mnlogm) • A better model of biology General model • The weight of a gap is some arbitrary w(q) • Time O(mn2 + nm2)
Example Revised 1 2
Indel Model 1 Score: -6 2 Score: -6 Scoring Parameters Match: +1 Indel: -2
Constant Model 1 Score: -6 2 Score: 12 Scoring Parameters Match: +1 Open gap: -2
Affine Model 1 Score: -17 2 Score: 1 Scoring Parameters Match: +1 Open gap: -2, each space: -1
Convex Model 1 Score: -6 2 Score: ~7 Scoring Parameters Match: +1 Open gap: -2, gap length: -logn
Affine Weight Model We divide all possible alignments of the prefixes s[1..i] and t[1..j] into 3 types s: i t: j s: i----- t: j s: i t: j-----
Affine Weight Model Recurrence relations:
Affine Weight Model Initial condition: Optimal alignment: Complexity: Time: O(mn) Space: O(mn)
B Wg+Ws Ws A S(i,j) S(i,j) Wg+Ws S(i,j) C Ws Affine Weight Model This model has a natural explanation as a finite state automata
Alignment in Real Life • One of the major uses of alignments is to find sequences in a “database” • Such collections contain massive number of sequences (order of 106) • Finding homologies in these databases with the standard dynamic programming can take too long • Example: • query protein : 232 AAs • NR protein DB: 2.7 million sequences; 748 million AAs • m*n = ~ 1.7 *1011cells !
Heuristic Search • Instead, most searches rely on heuristic procedures • These are not guaranteed to find the best match • Sometimes, they will completely miss a high-scoring match • We now describe the main ideas used by some of these procedures • Actual implementations often contain additional tricks and hacks
Basic Intuition • The main resource consuming factor in the standard DP is decision of where the gaps are. If there were no gaps, life was easy! • Almost all heuristic search procedures are based on the observation that real-life well-matching pairs of sequences often do contain long strings with gap-less matches. • These heuristics try to find significant local gap-less matches and then extend them.
Banded DP • Suppose that we have two strings s[1..n] and t[1..m] such that nm • If the optimal global alignment of s and t has few gaps, then path of the alignment will be close to the diagonal s t
Banded DP • To find such a path, it suffices to search in a diagonal region of the matrix • If the diagonal band has presumed width a, then the dynamic programming step takes O(an) • Much faster than O(n2) of standard DP in this case s a t
Banded DP Problem (for local alignment): • If we know that t[i..j] matches the query s[p..q], then we can use banded DP to evaluate quality of the match • However, we do not know i,j,p,q ! • How do we select which sub-sequences to align using banded DP?
FASTA Overview • Main idea: Find (fast!) “good” diagonals and extend them to complete matches • Suppose that we have a relatively long gap-less local match (diagonal): …AGCGCCATGGATTGAGCGA… …TGCGACATTGATCGACCTA… • Can we find “clues” that will let us find it quickly?
s t Signature of a Match Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA
FASTA • Given s and t, and a parameter k • Find all pairs (i,j) such that s[i..i+k] and t[j..j+k] match perfectly • Locate sets of pairs that are on the same diagonal by sorting according to i-j thus… • Locating diagonals that contain many close pairs. • This is faster than O(nm) ! s i i+k j j+k t
FASTA • Extend the “best” diagonal matches to imperfect (yet ungapped) matches, compute alignment scores per diagonal. Pick the best-scoring matches. • Try to combine close diagonals to potential gapped matches, picking the best-scoring matches. • Finally, run banded DP on the regions containing these matches, resulting in several good candidate alignments. • Most applications of FASTA use very small k(2 for proteins, and 4-6 for DNA)
BLAST Overview • FASTA drawback is its reliance on perfect matches • BLAST (Basic Local Alignment Search Tool) uses similar intuition, but relies on high scoringmatches rather than exact matches • Given parameters: length k, and threshold T • Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T
High-Scoring Pair • Given a query string s, BLAST construct all words w (“neighborhood words”), such that w is an HSP with a k-substring of s. • Note: not all k-mers have an HSP in s
BLAST: phase 1 • Phase 1: compile a list of word pairs (k=3) above threshold T • Example: for the following query: …FSGTWYA… (query word is in green) • A list of words (k=3) is: • FSG SGT GTW TWY WYA • YSG TGT ATW SWY WFA • FTG SVT GSW TWF WYS
BLAST: phase 1 scores GTW 6,5,11 22 neighborhood ASW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits below threshold (T=11)
BLAST: phase 2 • Search the database for perfect matches with neighborhoodwords. Those are “hits” for further alignment. • We can locate seed words in a large database in a single pass, given the database is properly preprocessed (using hashing techniques).
s t Extending Potential Matches • Once a hit is found, BLAST attempts to find a local alignment that extends it. • Seeds on the same diagonal tend to be combined (as in FASTA)
Two HSP diagonal • An improvement: look for 2 HSPs on close diagonals • Extend the alignment between them • Fewer extensions considered • There is a version of BLAST, involving gapped extensions. • Generally faster then FASTA, arguably better. s t
Blast Variants • blastn (nucleotide BLAST) • blastp (protein BLAST) • tblastn (protein query, translated DB BLAST) • blastx (translated query, protein DB BLAST) • tblastx (translated query, translated DB BLAST) • bl2seq (pairwise alignment)
Biological Databases • Today, most of the biological information can be freely accessed on the web. • One can: • Search for information on a known gene • Check if a sequence exists in a database • Find a homologous protein, helping us guess: • Structure • Function
Databases and Tool • Important gateways: • National Center for Biotechnology (GenBank) • http://www.ncbi.nlm.nih.gov/ • European Bioinformatics Institue (EMBL-Bank) • http://www.ebi.ac.uk/ • Expert Protein Analysis System (SwissProt) • http://www.expasy.org/ → Different tools and DBs to allow biologists a rich suite of queries
Database Types • Nucleotide DBs (GenBank, EMBL-Bank): • Contain any and every type of DNA fragment: • Full cDNA, ESTs, repeats, fragments • “Dirty” and redundant • Protein DBs (SwissProt): • Contain amino-acid sequences for full proteins • High quality, strict screening process • Lots of annotated information on each protein