880 likes | 1.04k Views
Computational Molecular Biology. Multiple Sequence Alignment. Sequence Alignment. Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them What is an Alignment: Given: 2 Strings S and S’
E N D
Computational Molecular Biology Multiple Sequence Alignment
Sequence Alignment • Problem Definition: • Given: 2 DNA or protein sequences • Find: Best match between them • What is an Alignment: • Given: 2 Strings S and S’ • Goal: The lengths of S and S’ are the same by inserting spaces (--; sometimes denote as ∆) into these strings My T. Thai mythai@cise.ufl.edu
Matches, Mismatches and Indels • Match: two aligned, identical characters in an alignment • Mismatch: two aligned, unequal characters • Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- -- -- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai mythai@cise.ufl.edu
Basic Algorithmic Problem • Find the alignment of the two strings that: • max m where m = (# matches – mismatches – indels) • Or min m where m is the SP-score of an alignment • m defines the similarity of the two strings, also called Optimal Global Alignment • Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai mythai@cise.ufl.edu
Multiple Sequence Alignment • Problem Definition: • Similar to the sequence alignment problem but the input has more than 2 strings • Challenges: • NP-hard, MAX-SNP • Guarantee factor: 2 – 2/k where k is the number of the input sequences. • More work to reduce the time and space complexity My T. Thai mythai@cise.ufl.edu
Sum of Pairs Score (SP-Score) • Given a finite alphabet and where ∆ denotes a space • Consider k sequences over that we want to align. After an alignment, each sequence has length l • A score d is assigned to each pair of letters: My T. Thai mythai@cise.ufl.edu
SP-Score • The SP-Score of an alignment A is defined as: • Consider a matrix of l columns and k rows where the rows represents the sequences and columns represent the letters • SP-Score is the sum of the scores of all columns: • Score of each column is the sum of the scores of all distinct unordered pairs of letters in the column • Or we can view as sum of pairwise sequence alignment values. • Find an (optimal) alignment to minimize the SP-Score value My T. Thai mythai@cise.ufl.edu
Proving MSA with SP-Score that is a Metric is NP-hard My T. Thai mythai@cise.ufl.edu
Some Notations My T. Thai mythai@cise.ufl.edu
Some Basic Properties • Lemma 1: Let s1, s2 be two sequences over Σ such that l1=|s1|, l2=|s2|, l2≥l1 and there are m symbols of s1 that are not in s2. Then every alignment of the set {s1,s2} has at least m+l2-l1 mismatches My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
The construction • Reduce the vertex cover (or node cover) to MSA. • Vertex cover: • Instance: A graph G=(V,E) and an integer k≤|V| • Question: Is there a vertex cover V1 of G of size k or less? • MSA: • Instance: A set S={s1, …, sn} of finite sequences over a fixed alphabet Σ, an SP-score and an integer C • Question: Is there a multiple alignment of the sequences in S that is of value C or less? My T. Thai mythai@cise.ufl.edu
SP-Score (alphabet of size 6) My T. Thai mythai@cise.ufl.edu
The Reduction So, we have , T is a set of C2 sequences t and X contains C1 sequences x(k), where C1 and C2 will be determined later My T. Thai mythai@cise.ufl.edu
An Example My T. Thai mythai@cise.ufl.edu
Intuition • By the above construction, an optimal alignment A of S is obtained when A satisfies certain properties (called standard alignment) • The value of standard alignment is bounded by a given threshold C only where G has a vertex cover of size k • How to obtain: • Force d’s of the test sequences to be aligned with b’s of the edge sequences • Only one b of each edge sequence can be aligned to a d • The number of such alignment determines the value of the alignment My T. Thai mythai@cise.ufl.edu
Standard Alignemnt My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
Let US and US,X denote the upper bounds of D(AS) and D(AS,X) respectively • By Corollary 8 and Lemma 9, we have the standard alignment has value not greater than DSD + US + US,X • where DSD = D(AX) + D(AT) + D(AX,T) + D(AS,T) over a standard alignment A • Now, let C1 > US and C2 > US + US,X, we can prove that an optimal alignment must be a standard one My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
Show the NP-hardness of any scoring matrix in a broad class M Show that there is a scoring matrix M0 such that MSA for M0 is MAX-SNP hard My T. Thai mythai@cise.ufl.edu
Interesting Observation • Via the brute force, optimal MSA contains very few gaps • Suggesting the study of gap limitations: • Have an upper bound of the number of gaps one can insert during the alignment • Special case: • Gap-0: No gap allows, but we can shift the strings for an alignment (insert gaps at the beginning or at the end of a string) • Gap-0-1: a gap-0 alignment such that the gaps at the beginning or at the end of each string is exactly one space My T. Thai mythai@cise.ufl.edu
Problem Definition • Given a finite alphabet • Scoring matrix • For i, j > 0, si,j represents the penalty for aligning ai with aj • For i > 0, s0,i and si,0 are called indel penalites • Gap opening penalties (in addition to the indel penalties) for aligning ai with the first or last ∆ in the string of ∆’s My T. Thai mythai@cise.ufl.edu
Generic Scoring Matrix Where Σ={A,T}, x, y, x are fixed nonnegative numbers and u > max{0, vA, vT} holds • Let M2 be the class of all scoring matrices that contain a generic submatrix M • Let M1 be the class of all scoring matrices that contain a sub-matrix isomorphic • to a generic matrix M with z > vT. • Let M be the class of all scoring matrices that contain a submatrix isomorphic • to a generic matrix M with y > u and z > vT. • Theorem 1: • The gap-0-1 multiple alignment problem is NP-hard for every scoring matrix M • in M2. • (b) The gap-0 multiple alignment problem is NP-hard for every M in M1 • (c) The multiple alignment problem is NP-hard for every M in M • Note that Mis quite broad and covers most scoring schemes used in • biological applications. My T. Thai mythai@cise.ufl.edu
Reduction • Reduce the MAX-CUT-B: • Given G=(V,E) where k=|V| and each vertex has a degree at most B • Find a partition of V into two disjoint sets such that to maximize the number of edges crossing these two sets • Given a graph G=(V,E) with k vertices v0, …, vk-1 and l edges e0, …, el-1. We will construct a set of k2 sequences t0, …, tk2-1 as follows: My T. Thai mythai@cise.ufl.edu
Reduction • For each vertex vi, construct a sequence ti such that • for each edge em={vh, vi} incident at vi, h < i, n < k5, set where ti,j represents the character at the jth position in ti. • For other j, let ti,j = T • For i ≥ k, set ti = T T T … T with length k12l My T. Thai mythai@cise.ufl.edu
An Example My T. Thai mythai@cise.ufl.edu
Proof of Theorem 1(a) • We will show that a gap-0-1 alignment will partition V into two disjoint subsets V0 and V1: • V0: all vertices vi such that ti remains in place (a space appends at the end) • V1: all vertices vi such that ti shifts to the right • Thus, based on the alignment, we can find the cut. And vice versa, based on the cut, we can find the alignment • The left part is: prove that if k is sufficiently large, the optimal gap-0-1 alignment yields a partion of V with maximum edge cut. My T. Thai mythai@cise.ufl.edu
Proof of Theorem 1(a) • Let c denote the cut based on the alignment A • Consider all the sequences ti after that alignment A: • The total indel penalties is of order O(k4) (appears at the first and last column in the SP score matrix) • The total number of mismatches before the alignment is 3k5l(k2-1) • To maximally reduce this number: • 1 A-A match reduces 2 A-T mismatches • For each edge (vh, vi), if there are in different subsets (of the partition), then a total of k5 A-A matches between sequences th and ti are created • No other A-T mismatches can be elimiated • Thus the SP-score: • k12lvTk2(k2-1)2+3k5l(u-vT)(k2-1)-ck5(2u-vA-vT)+O(k4) My T. Thai mythai@cise.ufl.edu
Theorem 2 Consider the following scoring matrix M0 for the alphabet ∑0 = {A,T,C}. • The gap-0-1 MSA problem is MAX-SNP-hard • The gap-0 MSA problem in MAX-SNP-hard • The MSA problem in MAX-SNP-hard My T. Thai mythai@cise.ufl.edu
MAX-SNP-hard Proof • To prove problem A’ is MAX-SNP-hard, we need to L-reduce problem A, which is MAX-SNP-hard to A’ • L-reduce: • There are two polynomial-time algorithms f, g and constants a, b > 0 such that for each instance I of A: • f produces an instance I’ = f(I) of A’ such that OPT(I’) ≤ aOPT(I) • Given any solution of I’ with cost c’, g produces a solution of I with cost c such that |c-OPT(I)| ≤ b|c’-OPT(I’)| My T. Thai mythai@cise.ufl.edu
Proof of Theorem 2 • To prove MSA (with M0 and the scoring matrix mentioned before) MAX-SNP-hard: • L-reduce the MAX-CUT-B to another optimization problem, called A’, which is L-reduce to a scaled version of MSA • Problem A’: • Given a graph G=(V,E) with bounded degree B. For every partition P={V0, V1}, let cp be the size of cut determined by P. • Find the partition P of V that minimizes dp = 3|E|-2cp My T. Thai mythai@cise.ufl.edu
Show A’ is MAX-SNP-hard • Let f and g be an identity function • Set a = 3B and b = 2, we can easily prove the two properties of the L-reduction since: • cp ≥|E|/B and dp = 3|E| - 2 cp ≤ 3 |E| • Any increase of cp by 1 = decrease dp by 2 My T. Thai mythai@cise.ufl.edu
Show A’ L-reduce to scaled MSA Similar to the above construction, we have: My T. Thai mythai@cise.ufl.edu
Similar to the proof of Theorem 1, we have the optimal SP-score: where • If the SP-score is scaled by a factor of k-5/2 for a MSA of k sequences, then A’ L-reduce to MSA. My T. Thai mythai@cise.ufl.edu
How do GAs work? • Create a population of random solutions • Use natural selection: • crossover and mutation to improve the solutions • Stop the operation if satisfying some certain criteria such as: • No improvement on fitness function • The improvement is less than some certain threshold • The number of iteration is more than some certain threhold
Terms and Definitions Chromosomes Potential solutions Population Collection of chromosomes Generations Successive populations
Terms and Definitions Crossover Exchange of genes between two chromosomes Mutation Random change of one or more genes in a chromosome Elitism Copy the best solutions without doing crossover or mutation.
Terms and Definitions • Offspring • New chromosome created by crossover between two parent chromosomes • Fitness function • Measures how “good” a chromosome is. • Encoding scheme • How do we represent every chromosome/gene? • Binary, combination, syntax trees.
Why are GAs attractive? No need for a particular algorithm to solve the given problem. Only the fitness function is required to evaluate the quality of the solutions. Implicitly a parallel technique and can be implement efficiently on powerful parallel computers for demanding large scale problems.
Basic Outline of a GA Initial population composed of random chromosomes, called first generation Evaluate the fitness of each chromosome in the population Create a new population: Select two parent chromosomes from a population according to their fitness Crossover (with some probability) to form a new offspring Mutation (with some probability) to mutate new offspring Place new offspring in a new population Process is repeated until a satisfactory solution evolves
Operations • Mutation Operation: • Modify a single parent • Try to avoid local minima
Let's see some running examples Minimum of a function: http://cs.felk.cvut.cz/~xobitko/ga/example_f.html Elitism: http://cs.felk.cvut.cz/~xobitko/ga/params.html The travelling salesman problem: http://cs.felk.cvut.cz/~xobitko/ga/tspexample.html
Multiple Sequence Alignment Fitness function is used to compare the different alignments Based on the number of matching symbols and the number and size of gaps Also called the cost function Different weights for different types of matches Gap costs can be simple and count the total matching symbols can be complicated and consider the type of matching symbols, location in the sequence, neighboring symbols etc.