570 likes | 771 Views
CS 5263 Bioinformatics. Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment. Roadmap. Last lecture review Sequence alignment statistics Today Heuristic alignment algorithms B asic L ocal A lignment S earch T ools Multiple sequence alignment algorithms.
E N D
CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment
Roadmap • Last lecture review • Sequence alignment statistics • Today • Heuristic alignment algorithms • Basic Local Alignment Search Tools • Multiple sequence alignment algorithms
Sequence Alignment Statistics • Substitution matrices • How is the BLOSUM-N matrix made • How to make your own substitution matrix • What’s the meaning of an arbitrary substitution matrix • Significance of sequence alignments • P-value estimation for • Global alignment scores • Local alignment scores • Critical score for local alignment to guarantee significance
State of biological databases Sequenced Genomes: Human 3109 Yeast 1.2107 Mouse 2.7109 Rat 2.6109 Neurospora 4107 Fugu fish 3.3108 Tetraodon 3108 Mosquito 2.8108 Drosophila 1.2108 Worm 1.0108 Rice 1.0109 Arabidopsis 1.2108 sea squirts 1.6108 Current rate of sequencing: 4 big labs 3 109 bp /year/lab 10s small labs Private sectors
State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences
Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: Our new gene 104 The entire genomic database 1010 - 1011 May take several weeks!
Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other functional units Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011 > 1000 years ???
BLAST • Basic Local Alignment Search Tool • Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 • The most widely used bioinformatics tool • One of the most cited papers in the history of science • Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? • Score-wise, exactly equivalent • Biologically, later may be more interesting, & is common • At least, if must miss some, rather miss the former • BLAST is a heuristic algorithm emphasizing the later • speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed
BLAST • Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/ • Along with many sequence databases Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB
BLAST Original Version …… Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold query …… scan DB query
BLAST Original Version A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A
Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A
Example Query:gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9|Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits(17),Expect = 4.5Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911
Example Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits • gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 • gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 • gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 • gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 • gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 • gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 • gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 • gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27 gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129),Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
BLAST Score: bit score vs raw score • Bit score is converted from raw score by taking into account K and : • S’ = ( S – log K) / log 2 • To compute E-value from bit score: • E = KM’N’ e-S = M’N’ 2-S’ • Critical score is now: • S* = log2(M’N’) • If S’ >> S*: significant • If S’ << S*: not significant (M’ ~ M, N’ ~ N)
Different types of BLAST • blastn: search nucleic acid databases • blastp: search protein databases • blastx: you give a nucleic acid sequence, search protein databases • tblastn: you give a protein sequence, search nucleic acid databases • tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences
BLAST cons and pros • Advantages • Fast!!!! • A few minutes to search a database of 1011 bases • Disadvantages • Sensitivity may be low • Often misses weak homologies • New improvement • Make it even faster • Mainly for aligning very similar sequences or really long sequences • E.g. whole genome vs whole genome • Make it more sensitive • PSI-BLAST: iteratively add more homologous sequences • PatternHunter: discontinuous seeds
Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower
Pattern hunter • Instead of exact matches of consecutive matches of k-mer, we can • look for discontinuous matches • My query sequence looks like: • ACGTAGACTAGCAGTTAAG • Search for sequences in database that match • AXGXAGXCTAXC • X stands for don’t care • Seed: 101011011101
Pattern hunter • A good seed may give you both a higher sensitivity and higher specificity • You may think 110110110110 is the best seed • Because mutation in the third position of a codon often doesn’t change the amino acid • Best seed is actually 110100110010101111 • Empirically determined • How to design such seed is an open problem • May combine multiple random seeds
Things we’ve covered so far • Global alignment • Needleman-Wunsch and variants • Local Alignment • Smith-Waterman • Improvement on space and time • More accurate gap penalty models • Heuristic algorithms • BLAST families • Statistics for sequence alignment
Commonality: They all deal with aligning two sequences • Pair-wise sequence alignment • Next: Aligning multiple sequences all together • Multiple sequence alignment • Motivation: • A faint similarity between two sequences becomes very significant if present in many sequences • Protein domains • Motifs responsible for gene regulation
Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global mapping is maximum • Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences • Same for a multiple alignment!
Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor x y z ? Phylogenetic tree or evolution tree w v
Scoring Function (cont’d) • Unfortunately: too many parameters • Compromises: • Ignore phylogenetic tree • Compute from pair-wise scores • Based on sum of all pair-wise scores • Based on scores with a consensus sequence
First assumption • Columns are independent • Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column
Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -
Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)
Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5
Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization • Every sequence derived from all others • Heuristic way to incorporate evolution tree • Weighted Sum of Pairs: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Human Mouse Duck Chicken
Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)
Multiple Sequence Alignments Algorithms
Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)
Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)
Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Practical for about 6 sequences of length about 200-300.
Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times
Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w
Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics
CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD
CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix
CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4
CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4
CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4
CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4
Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT
Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Progressive alignment
allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations