CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment

Roadmap • Last lecture review • Sequence alignment statistics • Today • Heuristic alignment algorithms • Basic Local Alignment Search Tools • Multiple sequence alignment algorithms

Sequence Alignment Statistics • Substitution matrices • How is the BLOSUM-N matrix made • How to make your own substitution matrix • What’s the meaning of an arbitrary substitution matrix • Significance of sequence alignments • P-value estimation for • Global alignment scores • Local alignment scores • Critical score for local alignment to guarantee significance

Heuristic Local Aligners-- BLAST and alike

State of biological databases Sequenced Genomes: Human 3109 Yeast 1.2107 Mouse 2.7109 Rat 2.6109 Neurospora 4107 Fugu fish 3.3108 Tetraodon 3108 Mosquito 2.8108 Drosophila 1.2108 Worm 1.0108 Rice 1.0109 Arabidopsis 1.2108 sea squirts 1.6108 Current rate of sequencing: 4 big labs  3 109 bp /year/lab 10s small labs Private sectors

State of biological databases Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences

Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? Assume we try Smith-Waterman: Our new gene 104 The entire genomic database 1010 - 1011 May take several weeks!

Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other functional units Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011 > 1000 years ???

BLAST • Basic Local Alignment Search Tool • Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 • The most widely used bioinformatics tool • One of the most cited papers in the history of science • Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? • Score-wise, exactly equivalent • Biologically, later may be more interesting, & is common • At least, if must miss some, rather miss the former • BLAST is a heuristic algorithm emphasizing the later • speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed

BLAST • Available at NCBI (National Center for Biotechnology Information) for download and online use. http://blast.ncbi.nlm.nih.gov/ • Along with many sequence databases Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

BLAST  Original Version …… Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold query …… scan DB query

BLAST  Original Version A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A

Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A

Example Query:gattacaccccgattacaccccgattaca (29 letters) [2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9|Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits(17),Expect = 4.5Identities = 20/21 (95%) Strand = Plus / Plus Query: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| Sbjct: 3891 tacacccagattacaccccga 3911

Example Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits • gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... 355 1e-95 • gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... 264 4e-68 • gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... 256 9e-66 • gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... 78 5e-12 • gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... 54 7e-05 • gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... 44 0.068 • gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... 42 0.27 • gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres... 42 0.27 gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517 Score = 256 bits (129),Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

BLAST Score: bit score vs raw score • Bit score is converted from raw score by taking into account K and : • S’ = ( S – log K) / log 2 • To compute E-value from bit score: • E = KM’N’ e-S = M’N’ 2-S’ • Critical score is now: • S* = log2(M’N’) • If S’ >> S*: significant • If S’ << S*: not significant (M’ ~ M, N’ ~ N)

Different types of BLAST • blastn: search nucleic acid databases • blastp: search protein databases • blastx: you give a nucleic acid sequence, search protein databases • tblastn: you give a protein sequence, search nucleic acid databases • tblastx: you give a nucleic sequence, search nucleic acid database, implicitly translate both into protein sequences

BLAST cons and pros • Advantages • Fast!!!! • A few minutes to search a database of 1011 bases • Disadvantages • Sensitivity may be low • Often misses weak homologies • New improvement • Make it even faster • Mainly for aligning very similar sequences or really long sequences • E.g. whole genome vs whole genome • Make it more sensitive • PSI-BLAST: iteratively add more homologous sequences • PatternHunter: discontinuous seeds

Variants of BLAST NCBI-BLAST: most widely used version WU-BLAST: (Washington University BLAST): another popular version Optimized, added features MEGABLAST: Optimized to align very similar sequences. Linear gap penalty BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower

Pattern hunter • Instead of exact matches of consecutive matches of k-mer, we can • look for discontinuous matches • My query sequence looks like: • ACGTAGACTAGCAGTTAAG • Search for sequences in database that match • AXGXAGXCTAXC • X stands for don’t care • Seed: 101011011101

Pattern hunter • A good seed may give you both a higher sensitivity and higher specificity • You may think 110110110110 is the best seed • Because mutation in the third position of a codon often doesn’t change the amino acid • Best seed is actually 110100110010101111 • Empirically determined • How to design such seed is an open problem • May combine multiple random seeds

Things we’ve covered so far • Global alignment • Needleman-Wunsch and variants • Local Alignment • Smith-Waterman • Improvement on space and time • More accurate gap penalty models • Heuristic algorithms • BLAST families • Statistics for sequence alignment

Commonality: They all deal with aligning two sequences • Pair-wise sequence alignment • Next: Aligning multiple sequences all together • Multiple sequence alignment • Motivation: • A faint similarity between two sequences becomes very significant if present in many sequences • Protein domains • Motifs responsible for gene regulation

Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global mapping is maximum • Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences • Same for a multiple alignment!

Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor x y z ? Phylogenetic tree or evolution tree w v

Scoring Function (cont’d) • Unfortunately: too many parameters • Compromises: • Ignore phylogenetic tree • Compute from pair-wise scores • Based on sum of all pair-wise scores • Based on scores with a consensus sequence

First assumption • Columns are independent • Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG - - - -

Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization • Every sequence derived from all others • Heuristic way to incorporate evolution tree • Weighted Sum of Pairs: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Human Mouse Duck Chicken

Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

Multiple Sequence Alignments Algorithms

Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)

Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Practical for about 6 sequences of length about 200-300.

Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w

Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4

CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4

Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Progressive alignment

allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y Note: Guaranteed to converge (why?) Running time: O(kNL2), k: number of iterations

CS 5263 Bioinformatics