Multiple Sequence Alignments

Multiple Sequence Alignments

z x y The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance

A Profile Representation - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 O .2 .8 .4 .4 E .4 C .2 .8 .4 .2 • Given a multiple alignment M = m1…mn • Replace each column mi with profile entry pi • Frequency of each letter in  • # gaps • Optional: # gap openings, extensions, closings

Multiple Sequence Alignments Algorithms

1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

1. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) }

1. Multidimensional Dynamic Programming • How do affine gaps generalize? • VERY badly! • Require 2N states, one per combination of gapped/ungapped sequences • Running time: O(2N 2N  LN) = O(4N LN) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN) Y YZ XY XYZ Z X XZ

2. Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: • Tree edges have weights, proportional to the divergence in that edge • New profile is a weighted average of two old profiles y Example Profile: (A, C, G, T, -) px = (0.8, 0.2, 0, 0, 0) py = (0.6, 0, 0, 0, 0.4) s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result:pxy= (0.7, 0.1, 0, 0, 0.2) s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result:px-= (0.4, 0.1, 0, 0, 0.5) z w

2. Progressive Alignment x • When evolutionary tree is unknown: • Perform all pairwise alignments • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment • Construct a tree (we will describe more in detail later in the course) • Align on the tree y ? z w

Aligning two alignments • Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps? m1 x GGGCACTGCAT y GGTTACGTC-- m2 z GGGAACTGCAG w GGACGTACC-- v GGACCT----- GTAGTCAGTCG x m1 ---GTCACGTG y GTCGTCAGTCG z m2 --CGCCAGGGG w --CGCCAGGGA v

Aligning two alignments • Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps? NP-hard! m1 x GGGCACTGCAT y GGTTACGTC-- m2 z GGGAACTGCAG w GGACGTACC-- v GGACCT----- GTAGTCAGTCG x m1 ---GTCACGTG y GTCGTCAGTCG z m2 --CGCCAGGGG w --CGCCAGGGA v Optimistic: assume no gap – don’t pay gap-open penalty Pessimistic: assume gap – pay gap-open penalty

Heuristics to improve multiple alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge

allow y to vary x,z fixed projection Iterative Refinement For each sequence y • Remove y • Realign y (while rest fixed) z x y

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

A* for Multiple Alignments Review of the A* algorithm v GOAL START • Say that we have a gigantic graph G • START: start node • GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

A* for Multiple Alignments Review of the A* algorithm h(v) g(v) v GOAL Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments axy, ayz, axz, …, and Score(M) = d(axy) + d(axz) + d(ayz) +… Each of d(.) is smaller than the optimal edit distance START • g(v) is the cost so far • h(v) is an estimate of the minimum cost from v to GOAL • f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v • Expand v with the smallest f(v) • Never expand v, if f(v) ≥ shortest path to the goal found so far

A* for Multiple Alignments • Nodes: Cells in the DP matrix • g(v): alignment cost so far • h(v): sum-of-pairs of individual pairwise alignments • Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments h(v) g(v) v GOAL START To compute h(v) For each pair of sequences x, y, Compute FR(x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i1, i2, …, iN), h(v) becomes the sum of (N choose 2) FR scores

Consistency zk z xi x y yj yj’

Consistency zk z Basic method for applying consistency • Compute all pairs of alignments xy, xz, yz, … • When aligning x, y during progressive alignment, • For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) • Align x and y with DP using the modified s(.,.) function xi x y yj yj’

Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

Whole-genome alignment Rat—Mouse—Human

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

Multiple Sequence Alignments