1 / 28

Multiple Sequence Alignments

Multiple Sequence Alignments. z. x. y. The Global Alignment problem. AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA. AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC. Definition. Given N sequences x 1 , x 2 ,…, x N : Insert gaps (-) in each sequence x i , such that All sequences have the same length L

cale
Download Presentation

Multiple Sequence Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignments

  2. z x y The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

  3. Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve the pairwise alignments

  4. Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  5. Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance

  6. A Profile Representation - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 O .2 .8 .4 .4 E .4 C .2 .8 .4 .2 • Given a multiple alignment M = m1…mn • Replace each column mi with profile entry pi • Frequency of each letter in  • # gaps • Optional: # gap openings, extensions, closings

  7. Multiple Sequence Alignments Algorithms

  8. 1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

  9. 1. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) }

  10. 1. Multidimensional Dynamic Programming • How do affine gaps generalize? • VERY badly! • Require 2N states, one per combination of gapped/ungapped sequences • Running time: O(2N 2N  LN) = O(4N LN) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN) Y YZ XY XYZ Z X XZ

  11. 2. Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: • Tree edges have weights, proportional to the divergence in that edge • New profile is a weighted average of two old profiles y Example Profile: (A, C, G, T, -) px = (0.8, 0.2, 0, 0, 0) py = (0.6, 0, 0, 0, 0.4) s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result:pxy= (0.7, 0.1, 0, 0, 0.2) s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result:px-= (0.4, 0.1, 0, 0, 0.5) z w

  12. 2. Progressive Alignment x • When evolutionary tree is unknown: • Perform all pairwise alignments • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment • Construct a tree (we will describe more in detail later in the course) • Align on the tree y ? z w

  13. Aligning two alignments • Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps? m1 x GGGCACTGCAT y GGTTACGTC-- m2 z GGGAACTGCAG w GGACGTACC-- v GGACCT----- GTAGTCAGTCG x m1 ---GTCACGTG y GTCGTCAGTCG z m2 --CGCCAGGGG w --CGCCAGGGA v

  14. Aligning two alignments • Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps? NP-hard! m1 x GGGCACTGCAT y GGTTACGTC-- m2 z GGGAACTGCAG w GGACGTACC-- v GGACCT----- GTAGTCAGTCG x m1 ---GTCACGTG y GTCGTCAGTCG z m2 --CGCCAGGGG w --CGCCAGGGA v Optimistic: assume no gap – don’t pay gap-open penalty Pessimistic: assume gap – pay gap-open penalty

  15. Heuristics to improve multiple alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

  16. Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = GA-CTT

  17. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge

  18. allow y to vary x,z fixed projection Iterative Refinement For each sequence y • Remove y • Realign y (while rest fixed) z x y

  19. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  20. Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

  21. A* for Multiple Alignments Review of the A* algorithm v GOAL START • Say that we have a gigantic graph G • START: start node • GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

  22. A* for Multiple Alignments Review of the A* algorithm h(v) g(v) v GOAL Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments axy, ayz, axz, …, and Score(M) = d(axy) + d(axz) + d(ayz) +… Each of d(.) is smaller than the optimal edit distance START • g(v) is the cost so far • h(v) is an estimate of the minimum cost from v to GOAL • f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v • Expand v with the smallest f(v) • Never expand v, if f(v) ≥ shortest path to the goal found so far

  23. A* for Multiple Alignments • Nodes: Cells in the DP matrix • g(v): alignment cost so far • h(v): sum-of-pairs of individual pairwise alignments • Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments h(v) g(v) v GOAL START To compute h(v) For each pair of sequences x, y, Compute FR(x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i1, i2, …, iN), h(v) becomes the sum of (N choose 2) FR scores

  24. Consistency zk z xi x y yj yj’

  25. Consistency zk z Basic method for applying consistency • Compute all pairs of alignments xy, xz, yz, … • When aligning x, y during progressive alignment, • For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) • Align x and y with DP using the modified s(.,.) function xi x y yj yj’

  26. Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

  27. Whole-genome alignment Rat—Mouse—Human

  28. Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

More Related