1 / 46

Inverse Alignment

Inverse Alignment. CS 374 Bahman Bahmani Fall 2006. The Papers To Be Presented. Sequence Comparison - Alignment. Alignments can be thought of as two sequences differing due to mutations happened during the evolution. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC.

fola
Download Presentation

Inverse Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inverse Alignment CS 374 Bahman Bahmani Fall 2006

  2. The Papers To Be Presented

  3. Sequence Comparison - Alignment • Alignments can be thought of as two sequences differing due to mutations happened during the evolution AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

  4. Scoring Alignments • Alignments are based on three basic operations: • Substitutions • Insertions • Deletions • A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scored by adding the scores of their operations. • Standard formulations of string alignment optimize the above score of the alignment.

  5. AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11 An Example Of Scoring an Alignment Using a Scoring Matrix

  6. Scoring Matrices in Practice • Some choices for substitution scores are now common, largely due to convention • Most commonly used Amino-Acid substitution matrices: • PAM (Percent Accepted Mutation) • BLOSUM (Blocks Amino Acid Substitution Matrix) BLOSUM50 Scoring Matrix

  7. Gap Penalties • Inclusion of gaps and gap penalties is necessary to obtain the best alignment • If gap penalty is too high, gaps will never appear in the alignment AATGCTGC ATGCTGCA • If gap penalty is too low, gaps will appear everywhere in the alignment AATGCTGC---- A----TGCTGCA

  8. Gap Penalties (Cont’d) Separate penalties for gap opening and gap extension Opening: The cost to introduce a gap Extension: The cost to elongate a gap Opening a gap is costly, while extending a gap is cheap Despite scoring matrices, no gap penalties are commonly agreed upon LETVGY W----L -5 -1 -1 -1

  9. Parametric Sequence Alignment • For a given pair of strings, the alignment problem is solved for effectively all possible choices of the scoring parameters and penalties (exhaustive search). • A correct alignmentis then used to find the best parameter values. • However, this method is very inefficient if the number of parameters is large.

  10. Inverse Parametric Alignment • INPUT: an alignment of a pair of strings. • OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings. • From Machine Learning point of view, this learns the parameters for optimal alignment from training examples of correct alignments.

  11. Inverse Optimal Alignment Definition (Inverse Optimal Alignment): INPUT: alignments A1, A2, …, Ak of strings, an alignment scoring function fw with parameters w = (w1, w2, …, wp). OUTPUT: values x = (x1, x2, …, xp) for w GOAL: each input alignment be an optimal alignment of its strings under fx . ATTENTION: This problem may have no solution!

  12. Inverse Near-Optimal Alignment • When minimizing the scoring function f, we say an alignment A of a set of strings S is –optimal, for some if: where is the optimal alignment of S under f.

  13. Inverse Near-Optimal Alignment (Cont’d) • Definition (Inverse Near-Optimal Alignment): INPUT: alignments Ai scoring function f real number OUTPUT: find parameter values x GOAL: each alignment Ai be -optimal under fx . The smallest possible can be found within accuracy using calls to the algorithm.

  14. Inverse Unique-Optimal Alignment • When minimizing the scoring function f, we say an alignment A of a set of strings S is -uniquefor some if: for every alignment B of S other than A.

  15. Inverse Unique-Optimal Alignment (Cont’d) • Definition (Inverse Unique-Optimal Alignment): INPUT: alignments Ai scoring function f real number OUTPUT: parameter values x GOAL: each alignment Ai be -unique under fx The largest possible can be found within accuracy using calls to the algorithm.

  16. Let There Be Linear Functions … • For most standard forms of alignment, the alignment scoring function f is a linear function of its parameters: where each fi measures one of the features of A.

  17. Let There Be Linear Functions … (Example I) • With fixed substitution scores, and two parameters gap open ( ) and gap extension ( ) penalties, p=2 and: where: g(A) = number of gaps l(A) = total length of gaps s(A) = total score of all substitutions

  18. Let There Be Linear Functions … (Example II) • With no parameters fixed, the substitution scores are also in our parameters and: where: a and b range over all letters in the alphabet hab(A) = # of substitutions in A replacing a by b

  19. Linear Programming Problem • INPUT: variables x = (x1, x2, …, xn) a system of linear inequalities in x a linear objective function in x OUTPUT: assignment of real values to x GOAL: satisfy all the inequalities and minimize the objective In general, the program can be infeasible, bounded, or unbounded.

  20. Reducing The Inverse Alignment Problems To Linear Programming • Inverse Optimal Alignment: For each Ai and every alignment B of the set Si, we have an inequality: or equivalently: The number of alignments of a pair of strings of length n is hence a total of inequalities in p variables. Also, no specific objective function.

  21. Separation Theorem • Some definitions: • Polyhedron: intersection of half-spaces • Rational polyhedron: described by inequalities with only rational coefficients • Bounded polyhedron: no infinite rays

  22. Separation Theorem (Cont’d) • Optimization Problem for a rational polyhedron P in : INPUT: rational coefficients c specifying the objective OUTPUT: a point x in P minimizing cx, or determining that P is empty. • Separation Problem for P is: INPUT: a point y in OUTPU: rational coefficients w and b such that for all points x in P, but (a violated inequality) or determining that y is in P.

  23. Separation Theorem (Cont’d) • Theorem (Equivalence of Separation and Optimization): The optimization problem on a bounded rational polyhedron can be solved in polynomial time if and only if the separation problem can be solved in polynomial time. That is, for bounded rational polyhedrons: OptimizationSeparation

  24. Cutting-Plane Algorithm • Start with a small subset S of the set L of all inequalities • Compute an optimal solution x under constraints in S • Call the separation algorithm for L on x • If x is determined to satisfy L output it and halt; otherwise, add the violated inequality to S and loop back to step (2).

  25. Complexity of Inverse Alignment • Theorem:Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in which: 1. the alignment scoring function is linear 2. the parameters values can be bounded 3. for any fixed parameter choice, an optimal alignment can be found in polynomial time. Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition: 3’. for any fixed parameter choice, a next-best alignment can be found in polynomial time.

  26. Application to Global Alignment • Initializing the Cutting-Plane Algorithm: We consider the problem in two cases: • All scores and penalties varying: Then the parameter space can be made bounded. • Substitution costs are fixed: Then either (1) a bounding inequality, or (2) two inequalities one of which is a downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope of the latter can be found in O(1) time, if they exist.

  27. Application to Global Alignment (Cont’d) • Choosing an Objective Function: Again we consider two different cases: • Fixed substitution scores: in this case we choose the following objective: • Varying substitution scores: In this case we choose the following objective: where s is the minimum of all non-identity substitution scores and i is the maximum of all identity scores.

  28. Application to Global Alignment (Cont’d) • For every objective, two extreme solutions exist: xlarge and xsmall. Then for every we have a corresponding solution: x1/2 is expected to better generalize to alignments outside the training set.

  29. Computational Results

  30. Computational Results (Cont’d)

  31. Computational Results (Cont’d)

  32. CONTRAlign • What: extensible and fully automatic parameter learning framework for protein pair-wise sequence alignment • How: pair conditional random fields (pair CRF s) • Who:

  33. Pair-HMMs for Sequence Alignment

  34. Pair-HMMs … (Cont’d) • If then: where:

  35. Training Pair-HMMs • INPUT: a set of training examples • OUTPUT: the feature vector w • METHOD: maximizing the joint log-likelihood of the data and alignments under constraints on w:

  36. Generating Alignments Using Pair-HMMs • Viterbi Algorithm on a Pair-HMM: INPUT: two sequences x and y OUTPUT: the alignment a of x and y that maximizes P(a|x,y;w) RUNNING TIME: O(|x|.|y|)

  37. Pair-CRFs • Directly model the conditional probabilities: where w is a real-valued parameter vector not necessarily corresponding to log-probabilities

  38. Training Pair-CRFs • INPUT: a set of training examples • OUTPUT: real-valued feature vector w • METHOD: maximizing theconditional log-likelihood of the data (discriminative/conditional learning) where is a Gaussian prior on w, to prevent over-fitting.

  39. Properties of Pair-CRFs • Far weaker independence assumptions than Pair-HMMs • Capable of utilizing complex non-independent feature sets • Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequences

  40. Choice of Model Topology in CONTRAlign • Some possible model topologies: CONTRAlignDouble-Affine : CONTRAlignLocal :

  41. Choice of Feature Sets in CONTRAlign • Some possible feature sets to utilize: 1. Hydropathy-based gap context features (CONTRAlignHYDROPATHY) 2. External Information: 2.1. Secondary structure (CONTRAlignDSSP) 2.2. Solvent accessibility (CONTRAlignACCESSIBILITY)

  42. Results: Comparison of Model Topologies and Feature Sets

  43. Results: Comparison to Modern Sequence Alignment Tools

  44. Results: Alignment Accuracy in the “Twilight Zone” For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.

  45. Questions?

  46. Thank You!

More Related