370 likes | 387 Views
This lecture covers the purpose of sequence alignment, including identification of homology and inference of evolutionary history. Different types of comparisons and alignment methods are discussed, along with scoring schemes and substitution matrices. Examples of dotplots and alignments are provided.
E N D
Lecture 4 Sequence alignment and searching
What is the purpose of sequence alignment? • Identification of homology and homologous sites in related sequences • Inference of evolutionary history that lead to the differences in observed sequences
Types of comparisons and alignment methods According to sequence Coverage: According to number od sequences:
Introduction to sequence alignment For the sequences gctgaacg and ctataatc: An uninformative alignment: -------gctgaacg ctataatc------- An alignment without gaps gctgaacg ctataatc An alignment with gaps gctga-a--cg --ct-ataatc And another gctg-aa-cg -ctataatc- Given two text strings: First string = a b c d e Second string = a c d e f a reasonable alignment would be a b c d e - a - c d e f We must choose criteria so that algorithm can choose the best alignment. Source: Lesk,Introduction to Bioinformatics
The dotplot (1) • A simple picture that gives an overview of the similarities between two sequences • Dotplot showing identities between short name (DOROTHYHODGKIN) and full name (DOROTHYCROWFOOTHODGKIN)of a famous protein crystallographer: Letters corresponding to isolated matches are shown in non-bold type. The longest matching regions, shown inboldface, are the first and last names DOROTHY and HODGKIN. Shorter matching regions, such as the OTH ofdorOTHy and crowfoOTHodgkin, or the RO of doROthy and cROwfoot, are noise. Source: Lesk,Introduction to Bioinformatics
The dotplot (2) Dotplot showing identities between a repetitive sequence (ABRACADABRACADABRA) and itself. The repeats appear on several subsidiary diagonals parallel to the main diagonal. Source: Lesk,Introduction to Bioinformatics
The dotplot (3) Dotplot showing identities between the palindromic sequence MAX I STAY AWAY AT SIX AM and itself. Thepalindrome reveals itself as a stretch of matches perpendicular to the main diagonal. Source: Lesk,Introduction to Bioinformatics
The dotplot (4) Source: Lesk,Introduction to Bioinformatics
Dotplots and sequence alignments Any path through the dotplot from upper left to lower right passes through a succession of cells, eachof which picks out a pair of positions, one from the row and one from the column, that correspond in the alignment; or that indicates a gap in one of the sequences. The path need not pass through filled-in points only. However, themore filled-in points on the diagonal segments of the path, the more matching residues in the alignment. Corrseponding alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Source: Lesk,Introduction to Bioinformatics
Measures of sequence similarity • Two measures of distance between two character strings: • The Hamming distance, defined between two strings of equal length, is the number of positions withmismatching characters. • The Levenshtein, or edit distance, between two strings of not necessarily equal length, is the minimalnumber of 'edit operations' required to change one string into the other, where an edit operation is a deletion,insertion or alteration of a single character in either sequence. A given sequence of edit operations induces aunique alignment, but not vice versa. agtc cgta Hamming distance = 2 ag-tcc cgctca Levenshtein distance =3 Source: Lesk,Introduction to Bioinformatics
Scoring schemes • In molecular biology, certain changes are more likely to occur than others • Amino acid substitutions tend to be conservative • In nucleotide sequences, transitions are more frequent than transversions • -> We want to give different weights to different edit operations • Hamming and Lenenshtein distance are measues of dissimilarity. In MB it is more common to measure similarity (higher similarity = higher score) • Example: a DNA substitution matrix: a g c t a 20 10 5 5 g 10 20 5 5 c 5 5 20 10 t 5 5 10 20 Source: Lesk,Introduction to Bioinformatics
Derivation of substitution matrices Start frequencies for a genetic matrix (upper triangular matrix) Expected number of random mutation events Start frequencies for PAM matrices (lower triangular matrix) Observed number of substitutions
PAM Matrices • First substitution matrices widely used • Based on the point-accepted-mutation (PAM) model of evolution (Dayhoff, 1978) • PAMs are relative measures of evolutionary distance • 1 PAM = 1 accepted mutation per 100 AAs • Does not mean that after 100 PAMs every AA will be different? Why or why not? To view and compare different substitution matrices, visit http://eta.embl-heidelberg.de:8000/misc/mat/
PAM Matrices • Derived from global alignments of closely related sequences. • Matrices for greater evolutionary distances are extrapolated from those for lesser ones. • The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. • Does not take into account different evolutionary rates between conserved and non-conserved regions.
BLOSUM Matrices • Henikoff, S. & Henikoff J.G. (1992) • Use blocks of protein sequence fragments from different families (the BLOCKS database) • Amino acid pair frequencies calculated by summing over all possible pairs in block • Different evolutionary distances are incorporated into this scheme with a clustering procedure (identity over particular threshold = same cluster) • Similar idea to PAM matrices • Probabilities estimated from blocks of sequence fragments • Blocks represent structurally conserved regions
BLOSUM Matrices • Target frequencies are identified directly instead of extrapolation. • Sequences more than x% identitical within the block where substitutions are being counted, are grouped together and treated as a single sequence • BLOSUM 50 : >= 50% identity • BLOSUM 62 : >= 62 % identity To view and compare different substitution matrices, visit http://eta.embl-heidelberg.de:8000/misc/mat/
BLOSUM Matrices • Derived from local, ungapped alignments of distantly related sequences • All matrices are directly calculated; no extrapolations are used – no explicit model • The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. • The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49).
How do you prefer it: done right or done fast? • Exact methods - the result is guaranteed to be (mathematically) optimal • Pairwise alignment • Needleman-Wunsch (global) • Smith-Waterman (local) • Multiple alignment • Rarely used; quickly becomes computationally intractable with increasing number of sequences and/or their length • Heuristic methods: make some assumptions that hold most, but not all of the time • Pairwise alignment – BLAST • Multiple alignment – all widely used multiple alignment programs (ClustalW, TCofee…)
Exact pairwise methods: Dynamic programming Equivalence matrix Cumulative matrix (can be computed from C-or N-terminus) An example of a cumulative matrix (computed from C-terminus) with traces of possible alignments)
What does it mean? Principle of reduction of number of paths that need to be examined: If a path from X->Z passes through Y, the best path from X->Y is independent of the best path from Y->Z Source: Lesk,Introduction to Bioinformatics
BLAST = Basic local alignment search tool When you have a nucleotide or protein sequence that you want to search against sequence databases to determine what the sequence is to find related sequences (homologs) Bioinformatics researchers use standalone version of BLAST, run from the command line or other programs BLAST – the workhorse of bioinformatcshttp://www.ncbi.nlm.nih.gov/BLAST
BLAST algorithm principles (Basic Local Alignment Search Tool) Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB
BLAST Original Version …… query Dictionary: All words of length k (~11) Alignment initiated between words of alignment score T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… scan DB query
BLAST Original Version A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A
Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A
Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Nearby alignments are merged • Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A
BLAST: Example of sequence search Query sequence or accession number Sequence database to search
BLAST: Example of sequence search Please come back later
BLAST results: List of hits E–value - number of unrelated databank sequences expected to yield same or higher score by pure chance
BLAST results: High scoring pairs (HSPs) Fundamental unit of the BLAST algorithm output HSP (high scoring pair) Aligned fragments of query and detected sequence with similarity score exceeding a set cutoff value E-value (Expectation) Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct:259 AGDVSMFLLLPDEIADV 275 HSP
BLAST Confidence measures Score and bit-score :depend on scoring method E-value (Expect value) : number of unrelated database sequences expected to yield same or higher score by pure chance P-value (Probability): probability that database yields by pure chanceat least one alignment with same or higer score The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. (E-value approching zero => significant alignment. Less than 0,01 = almost always found to be homologous; 1e-10 for nucleotide searches of 1e-4 for protein searches= frequently related) In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.
Be careful when comparing E- or P-values from different searches Comparison is only meaningful for different query sequences searched agains the same database with the same BLAST parameters
Variants of BLAST • MEGABLAST: • Optimized to align very similar sequences • Works best when k = 4i 16 • Linear gap penalty • PSI-BLAST: • BLAST produces many hits • Those are aligned, and a pattern is extracted • Pattern is used for next search; above steps iterated • WU-BLAST: (Washington University BLAST) • Optimized, added features • BlastZ • Combines BLAST/PatternHunter methodology • Good for local alignments of non-coding (regulatory) regions
IMPORTANT: When NOT to use BLAST • A typical situation: you have a piece of human DNA sequence and want to extend it or find what gene it belongs to. • In other words, you want an exact or near-exact match to a sequence that is part of an assembled genome. • Genome browsers are equipped with very fast algorithms for finding near-exact matches in genomic sequences: • BLAT (with UCSC Genome Browser) • Highly recommended: the BLAT paper (Kent WJ (2003) Genome Res 12:656-64) - it is famous for its unorthodox writing style • SSAHA (with Ensembl)
BLAT: BLAST-like alignment tool • Differences with respect to BLAST: • Dictionary of the DB (BLAST: query) • Initiate alignment on any # of matching hits (BLAST: 1 or 2 hits) • More aggressive stitching-together of alignments • Special code to align mature mRNA to DNA • Result: very efficient for: • Aligning many sequences to a DB • Aligning two genomes