Multiple Sequence Alignments

Multiple Sequence Alignments

Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: • Tree edges have weights, proportional to the divergence in that edge • New profile is a weighted average of two old profiles pxy y z pxyzw pzw w

Progressive Alignment x • When evolutionary tree is unknown: • Perform all pairwise alignments • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment • Construct a tree (UPGMA / Neighbor Joining / Other methods) • Align on the tree y ? z w

Heuristics to improve alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = G-ACTT

allow y to vary x,z fixed projection Iterative Refinement Algorithm (Barton-Stenberg): • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence z x y

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA Variant: Refinement on a tree

Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

Consistency zk z xi x y yj yj’

Consistency zk z Basic method for applying consistency • Compute all pairs of alignments xy, xz, yz, … • When aligning x, y during progressive alignment, • For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) • Align x and y with DP using the modified s(.,.) function xi x y yj yj’

Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

MUSCLE at a glance • Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time • Build tree TDRAFT based on those distances, with UPGMA • Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT • Only perform alignment steps for the parts of the tree that have changed • Measure new Kimura-based distances D(x, y) based on MDRAFT • Build tree T based on D • Progressive alignment over T, to build M • Iterative refinement; for many rounds, do: • Tree Partitioning: Split M on one branch and realign the two resulting profiles • If new alignment M’ has better sum-of-pairs score than previous one, accept

PROBCONS at a glance • Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM • Re-estimation of posterior matrices M’xy with probabilistic consistency • M’xy(i, j) = 1/N sequence zk Mxz(i, k)  Myz (j, k); M’xy = Avgz(MxzMzy) • Compute for every pair x, y, the maximum expected accuracy alignment • Axy: alignment that maximizes aligned (i, j) in AM’xy(i, j) • Define E(x, y) = aligned (i, j) in AxyM’xy(i, j) • Build tree T with hierarchical clustering using similarity measure E(x, y) • Progressive alignment on T to maximize E(.,.) • Iterative refinement; for many rounds, do: • Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Motivation • Genomic sequences are very long: • Human genome = 3 x 109 –long • Mouse genome = 2.7 x 109 –long • Aligning genomic regions is useful for revealing common gene structure • Useful to compare regions > 1,000,000-long

Main Idea Genomic regions of interest contain islands of similarity, such as genes • Find local alignments • Chain an optimal subset of them • Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

Saving cells in DP • Find local alignments • Chain -O(NlogN) L.I.S. • Restricted DP

Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

Quadratic Time Solution • Build Directed Acyclic Graph (DAG): • Nodes: local alignments [(xa,xb) (ya,yb)] & score • Directed edges: local alignments that can be chained • edge ( (xa, xb, ya, yb), (xc, xd, yc, yd) ) • xa < xb < xc < xd • ya < yb < yc < yd Each local alignment is a node vi with alignment score si

Quadratic Time Solution Initialization: Find each node va s.t. there is no edge (u, va) Set score of V(a) to be sa Iteration: For each vi, optimal path ending in vi has total score: V(i) = maxjs.t. there is edge (vj, vi) ( weight(vj, vi) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from vj Worst case time: quadratic

Sparse Dynamic Programming Back to the LCS problem: • Given two sequences • x = x1,…, xm • y = y1, …, yn • Find the longest common subsequence • Quadratic solution with DP • How about when “hits” xi = yj are sparse?

Sparse Dynamic Programming • Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Sparse Dynamic Programming – L.I.S. • Longest Increasing Subsequence • Given a sequence over an ordered alphabet • x = x1,…, xm • Find a subsequence • s = s1, …, sk • s1 < s2 < … < sk

Sparse Dynamic Programming – L.I.S. Let input be w: w1,…, wn INITIALIZATION: L: 1-indexed array, L[1]  w1 B: 0-indexed array of backpointers; B[0] = 0 P: array used for traceback // L[j]: smallest last element wi of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1]  w[i] B[j+1]  i P[i]  B[j] } That’s it!!! • Running time?

Sparse LCS expressed as LIS x Create a sequence w • Every matching point (i, j), is inserted into w as follows: • For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order • The 11 example points are inserted in the order given • a = (y, x), b = (y’, x’) can be chained iff • a is before b in w, and • y < y’ y

Sparse LCS expressed as LIS x Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where • (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y y

Sparse Dynamic Programming for LIS x • Algorithm: initialize empty array L /* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */ for i = 1 to |w| binary search for w[i] in L, to find lj< w[i] ≤ lj+1 replace lj+1 with w[i] keep a backptr lj w[i] That’s it!!! y

Sparse Dynamic Programming for LIS x Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = • (4,2) • (3,3) • (3,3) (10,5) • (2,5) (10,5) • (2,5) (8,6) • (1,6) (8,6) • (1,6) (3,7) • (1,6) (3,7) (4,8) • (1,6) (3,7) (4,8) (7,9) • (1,6) (3,7) (4,8) (5,9) • (1,6) (3,7) (4,8) (5,9) (9,10) y Longest common subsequence: s = 4, 24, 3, 11, 18

Sparse DP for rectangle chaining • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) • L is sorted by lj: top to bottom • L is implemented as a balanced binary tree h l y

Sparse DP for rectangle chaining Main idea: • Sweep through x-coordinates • To the right of b, anything chainable to a is chainable to b • Therefore, if V(b) > V(a), rectangle a is “useless” – remove it • In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j) V(b) V(a)

Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: • When on the leftmost end of i: • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i: • k: rectangle in L, with largest lk li • If V(i)  V(k): • INSERT (li, V(i), i) in L • REMOVE all (lj, V(j), j) with V(j)  V(i) & lj li j i k

Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N per deletion • Each element is deleted at most once: N log N for all deletions • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Examples Human Genome Browser ABC

Whole-genome alignment Rat—Mouse—Human

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

Multiple Sequence Alignments