1.14k likes | 1.31k Views
Chapter 2. Pairwise Alignment. Pairwise Alignment. Ask if two sequences are related First align the sequences (or parts of them) and then decide whether that alignment is more likely to have occurred because the sequences are related, or just by chances. Sequence Alignment.
E N D
Chapter 2 Pairwise Alignment
Pairwise Alignment • Ask if two sequences are related • First align the sequences (or parts of them) and then decide whether that alignment is more likely to have occurred because the sequences are related, or just by chances
Sequence Alignment • Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences • Pair-wise alignment: compare two sequences • Multiple sequence alignment: compare more than two sequences
Example sequence alignment • Task: align “abcdef” with “abdgf” • Write second sequence below the first abcdef abdgf • Move sequences to give maximum match between them • Show characters that match using the identical letter
Example sequence alignment abcdef ab abdgf • Insert gap between b and d on lower sequence to allow d and f to align
Example sequence alignment abcdef ab d f ab-dgf
Example sequence alignment abcdef ab d f ab-dgf • Note e and g don’t match
Matching Similarity vs. Identity • Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters • More on how to define similarity later
Global vs. Local Alignment • We distinguish • Global alignment algorithms which optimize overall alignment between two sequences • Local alignment algorithms which seek only relatively conserved pieces of sequence • Alignment stops at the ends of regions of strong similarity • Favors finding conserved patterns in otherwise different pairs of sequences
Global vs. Local Alignment • Global LGPSSKQTGKGS-SRIWDN L k GKG R D LN-ITKSAGKGAIMRLGDA • Local --------GKG-------- GKG --------GKG--------
Why do sequence alignments? • To find whether two (or more) genes or proteins are evolutionarily related to each other • To find structurally or functionally similar regions within proteins
Key Issues • What sorts of alignment should be considered • The scoring system used to rank alignments • The algorithm used to find optimal (or good) scoring alignments • The statistical methods used to evaluate the significance of an alignment score
Example • The following figure shows an example of three pairwise alignments, all to the same region of the human alpha globin protein sequence (SWISS-POTR database identifier HBA_HUMAN). • Identical positions with letters, and ‘similar’ positions with a plus (+) sign
Example • In the first alignment, there are many “matches”; many others are functionally conservative (D-E towards the end) • The second alignment shows a biologically meaningful alignment (evolutionarily related, the same 3D structure, and same function in oxygen binding); many fewer identities • The third alignment has a similar number of identities or conservative changes; A spurious alignment to a protein that has a completely different structure and function
Challenges • How to distinguish the second one from the third one? • The determination of the scoring system is crucial • It is difficult to distinguish true alignments from spurious alignments
The Scoring Model • When comparing sequences, we look for evidence that they have diverged from a common ancestor by a process of mutation and selection • Basic mutational processes • Substitutions: change residues in a sequence • Insertions and deletions: add or remove residues • Insertions and deletions are referred to as “gaps” • The total score assigned to an alignment will be a sum of terms for each aligned pair of residues, plus terms for each gap
The Scoring Model • We expect identities and conservative substitutions to be more likely in alignments than we expect by chance, and so to contribute positive score terms • Non-conservative changes are expected to be observed less frequently in real alignments than we expect by chance, and so these contribute negative score terms
Assumption • We can consider mutations at different sites in a sequence to have occurred independently • This is reasonable for DNA and protein sequences • The interactions between residues also play a very critical role • Long range dependencies should be considered for structural RNAs
Substitution Matrices • Consider a pair of sequences, x and y, of lengths n and m • Let xibe the ith symbol in x and yjbe the jth symbol in y • These symbols come from somealphabet A; in the case of DNA this will be the four bases {A, G, C, T}, and in the case of proteins the twenty amino acids • We will only consider ungapped global pairwise alignments, i.e., two completely aligned equal-length sequences
Rationale • Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relatively likelihood that the sequences are related as opposed to being unrelated • Assign a probability to the alignment in each of the two cases • We consider the ratio of the two probabilities
Unrelated or Random Model • Let R be the unrelated model • The letter a occurs independently with some frequency qa, and hence the probability of the two sequences is just the product of the probabilities of each amino acid:
Alternative Match Model • Let M be the alternative match model • Aligned pairs of residues occur with a joint probability pab • A probability for the whole alignment is
The Odds Ratio The ratio of these two likelihoods is known as the odds ratio:
The Log Odds Ratio We take the logarithm of the odds ratio: is the log likelihood ratio of the residue pair (a, b) occurring as an aligned pair, as opposed to an unaligned pair
Substitution Matrices • The s(a,b) scores can be arranged in a matrix • For proteins, they form a 20X20 matrix (score matrix or substitution matrix) • Using BLOSUM50 matrix, the first alignment gets a score of 130 • PAM matrices • Any substitution matrix is making a statement about the probability of observing ab pairs in real alignment
Gap Penalties The standard cost associated with a gap of length g is given by a linear score where d is called the gap-open penalty and e is called the gap-extension penalty.
Gap Penalties • The gap-extension penalty e is usually set to something less than the gap-open penalty d, allowing long insertions and deletions to be penalized less than they would be by the linear gap cost • This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue
Gap Probability The probability of a gap occurring at a particular site in a given sequence is qa probabilities are the same as those used in the random model. When we divide by the probability of this region according to the random model to form the odds ratio, the qxi terms cancel out, so we are left only with a term dependent on length γ(g)=log(f(g)); i.e., gap penalties correspond to the log probability of a gap of that length.
Alignment Algorithms • Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences • While both sequences have the same length n, there is only one possible global alignment of the complete sequences • When gaps are allowed, there are possible global alignments between two sequences of length n
Example (1) ab (2) ab- (3) ab- (4) -ab cd -cd c-d cd- (5) ab-- (6) -ab- --cd c--d
Dynamic Programming • Guarantee to find the optimal scoring alignment or set of alignments • Central to computational sequence analysis • Maximize the score to find the optimal alignment
Example • We wish to align two short amino acid sequences: HEAGAWGHEE & PAWHEAE • We use the BLOSUM50 score matrix, and a gap cost per unaligned residue of d=-8
Global Alignment: Needleman-Wunsch Algorithm • Construct a matrix F indexed by i, j, one index from each sequence • F(i, j) is the score of the best alignment between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj • Begin F(0,0)=0, we then fill the matrix from top left to bottom right • If F(i-1, j-1), F(i-1, j), and F(i, j-1) are known, it is possible to calculate F(i, j)
Three Ways of Alignments • xi is aligned to yj IGA xi LGVyj • xi is aligned to a gap AIG A xi GVyj - - • yj is aligned to a gap GA xi - - SLG Vyj
Three Ways of Alignments • xi is aligned to yj, F(i, j)= F(i-1, j-1)+s(xi, yj) • xi is aligned to a gap, F(i, j)= F(i-1, j)-d • yj is aligned to a gap, F(i, j)= F(i, j-1)-d
The Diagram F(i-1,j-1) F(i,j-1) s(xi,yj) -d F(i-1,j) F(i,j) -d
The F Matrix • As we fill in the F(i, j), we also keep a pointer in each cell back to the cell from its F(i, j) was derived • Along the top row, where j=0, the values F(i, j-1) and F(i-1, j-1) are not defined, so the values F(i, 0) must be handled specially • The values F(i, 0) represent alignments of a prefix of x to all gaps in y, so we can define F(i, 0) =-id. Likewise, F(0, j)=-jd • F(n, m) is the best score for an alignment of x1…n to y1…m
Local Alignment: Smith-Waterman Algorithm • In previous section, we know which sequences we want to align, and we are looking for the best match between them from one end to the other • Most often, we are looking for the best alignment between subsequence of x and y • When it is suspected that two protein sequence may share a common domain, or when comparing extended sections of genomic DNA sequence
Local Alignment • It is the most sensitive way to detect similarity when comparing two very highly diverged sequences, even if they may share evolutionary origin • In this case, only part of the sequence has been under strong enough selection to preserve detectable similarity; the rest will have accumulated so much noise through mutation that it is no longer alignable • The highest scoring alignment of subsequences of x and y is called the best local alignment
The Algorithm • The algorithm is closely related to that for global alignments However, there are two differences. • In each cell in the table, F(i, j) is allowed to take 0 if all other options have values less than 0 • An alignment can end anywhere in the matrix
The First Difference • Taking the option 0 corresponds to starting a new alignment • If the best alignment up to some point has a negative score, it is better to start a new one, rather then extend the old one • The top row and left column will be filled with 0s, not –id and –jd as for global alignment
The Second Difference • We look for the highest value of F(i, j) over the whole matrix, and start the traceback from there • The traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment