750 likes | 1k Views
A new approach to sequence comparison:. Normalized sequence alignment B87506029 李建鴻 R88725032 朱漢農 D87631003 饒瑞佶 D90921014 吳明龍. Abstract.
E N D
A new approach tosequence comparison: Normalized sequence alignment B87506029 李建鴻 R88725032 朱漢農 D87631003 饒瑞佶 D90921014 吳明龍
Abstract • Local Sequence alignmentinput: A, B, S (score table)output: i1,j1,i2,j2 s.t. A[i1..j1] and B[i2..j2] has best alignment score. ex: 0 1 1 0 1 vs. 1 0 0 1 0
The Smith-Waterman algorithm • Definition: Sequence alignment using dynamic programming. • one of the most important techniques in computational molecular biology • Discarding poorly conserved initial and terminal seqments.
The Smith-Waterman algorithm(cont.) • Flaw: • Does not discard poorly conserved intermediate segments. • Mosaic effect • Can not answers the question: do two sequences share a fragment with more than 70%? Normalized local alignment: report the maximum degree of similarity.
Introduction • Gene prediction: • Human Genome Project: • Gene prediction in human genome often amount to using related proteins from other species as clues for finding exon-intron structure. • similarity: exons 85 %intron 35 % • use local alignment (Smith-Waterman algorithm) to find the most similar segments
Shadow effect The more biologically important ‘short’ (high similarity) alignment will not be detected if there is a long alignment with higher scores.( but less similarity)
Mosaic effect The local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments.
Fixed it • Goad and kanehisa: • Introduced alignment with miniimal mismatch density. • Did not lead to successful algorithms • Webb Miller:fix this problem at the post-processing stage. • Zhang et al.: • decompose a local alignment into sub-alignments that avoid the mosaic effect. • The approach may miss the alignments with the best degree of similarity if the Smith-Waterman algorithm missed them.
Fixed it • X-drop • A region with an alignment the scores below X: • X-alignments: The alignment that contain no X-drops. expensive to compute in practice.
Another problem • The Smith-Waterman algorithm can not correctly find the most biologically adequate relative in a benchmark sample of different protein families. That algorithm does not take into account the length of the alignment normalize the alignment score by its length.
Normalized local alignment problem • substring I,Js(I,J) : score max s(I,J)/(|I|+|J|) with |I|+|J| T T: a threshod for the minimal overall length • With no restriction on overall length, we can use fractional programming develop fast algorithms, but not biologically meaningful.
Normalized local alignment problem • Slight different: max s(I,J)/(|I|+|J|+L) for a given parameter L. • Control over the degree of normalization by varying L. • Be able to use fractional programming technique for fast computation.
Parameter L • If L=0 • a=A b=A NLA*1=1/2 • a=ACG..ACGT b=ACG..ACGT |a|=|b|=100 NLA*2=100/200 • If L=100 • NLA*1=1/(2+100)=1/102=0.01 • NLA*2=100/(200+100)=100/300=0.33 • L can not too big
Outline of this paper • Formal definition • Dinkelbach’s and Megiddo’s methods as we use in our algorithms. • Description of algorithm • Discussion of implementation • Concluding
Normalized Local Alignment • Formulate the alignment problems first: • Let a = a1a2…an and b = b1b2…bm be 2 sequences with n m. A new approach to sequence comparison: normalized sequence alignment
Alignment Graph Ga,b • Representing all possible alignments between a and b • Directed acyclic graph • (n+1)x(m+1) lattice points (u, v) as vertices, for 0 u n, and 0 v m
path Ex term score vector
4 types of arcs in Ga,b: 1. Horizontal arcs: {((u,v-1),(u,v)) | 0un, 0<vm} 2. Vertical arcs: {((u-1,v),(u,v)) | 0<un, 0vm} 3. Matching diagonal arcs: {((u-1,v-1),(u,v)) | au=bv, 0<un, 0<vm} 4. Mismatching diagonal arcs: {((u-1,v-1),(u,v)) | aubv, 0<un, 0<vm}
Alignment path: • By performing the corresponding edit operations in ai…ak, we obtain bj…bl • Horizontal arc ((u,v-1),(u,v)): insert bvafter au • Vertical arc ((u-1,v),(u,v)): delete au • Mismatching diagonal arc ((u-1,v-1),(u,v)): substitute bv for au Ga,b
Ex: a = A T T G T • ((4,6),(5,7)) ATTGT • ((4,5),(4,6)) ATTGAT • ((4,4),(4,5)) ATTGCAT • ((4,3),(4,4)) ATTGACAT • ((3,2),(4,3)) ATTGACAT • ((2,1),(3,2)) ATGGACAT Ga,b
((1,1),(2,1)) AGGACAT • ((0,0),(1,1)) AGGACAT Ga,b
indel: • insertions (horizontal arcs) + deletions (vertical arcs) • match: • matching diagonal arcs • mismatch: • mismatching diagonal arcs Ga,b
Assumption of scoring: • Match: 1, • Mismatch: , • Indel: , where and are positive reals. Ga,b
Alignment vector: • For ai…ak and bj…bl, there is an alignment path between the vertices (i-1,j-1) and (k,l) in Ga,b with x matches, y mismatches, and z indels. • We denote the set of all such alignment vector by AVi,j,k,l(a,b) = {(x, y, z) | (x, y, z) is an alignment vector for ai…ak and bj…bl} Ga,b
Next, we define AV(a,b) as the set of all alignment vectors, i.e. AV(a,b) = AVi,j,k,l(a,b) (1)
Depending on the score table, we have SCORE(x, y, z) = x –y –z (2) • Then, we denote the maximum score between ai…ak and bj…bk by S,(ai…ak,bj…bl) = max{SCORE(x, y, z) | (x, y, z)AVi,j,k,l(a,b)} (3)
Local Alignment problem seeks for two segments with the highest similarity score LA* : LA*,(a,b) = S,(ai…ak,bj…bl) = {SCORE(x,y,z) | (x,y,z)AVi,j,k,l(a,b)}
By equation (1), we have LA*,(a,b) = max{SCORE(x,y,z) | (x,y,z)AV(a,b)} (4)
Normalized score (NSL): NS,,L(ai…ak,bj…bl) = , (5) where LENGTHL(ai…ak,bj…bl) = (k-i+1)+(l-j+1)+L
Normalized Loal Alignment (NLA) problem: NLA*,,L(a,b) = {NS,,L(ai…ak,bj…bl)}
Observe: If (x, y, z) is an alignment vector for ai…ak and bj…bl, then (k-i+1) + (l-j+1) = 2x + 2y + z So, LENGTHL(x, y, z) = 2x + 2y + z + L (6) Ga,b
By (1), (3), (5), and (6), we can define the objective of the NLA problem as NLA*,,L(a,b) = (7)
Algorithms • The algorithm problems are optimization problems of linear functions.
By equations (2) and (6), and definitions (4) and (7): LA,(a,b): maximize x –y –z s.t. (x,y,z)AV(a,b) NLA,,L(a,b): maximize s.t. (x,y,z)AV(a,b)
Parametric local alignment problem For a given , we define a problem LA ,,L()(a,b): maximize x - y - z - (2x+2y+z+L) s.t. (x,y,z)AV(a,b)
Proposition 1. For any normalized scores < ½, the LA*() can be formulated in terms of LA*.
Proof of proposition 1: LA*() = max{(1-2)x - (+2)y - (+)z - L} = = (1-2)LA*’,’(a,b) - L (8)
Thus, computing LA*() involves solving LA’,’(a,b) • Since , and L are positive, for any alignment vector (x’, y’, z’),
Dinkelbach’s algorithm • Dinkelbach (1967) has developed a general algorithm which uses the parametric method of an optimization technique known as fractional programming.
The NLA* can be achieved via a series of LA*() for different . • = NLA* iff LA*() = 0
Pick an arbitrary a.v. (x,y,z)AV(a,b) do{ * Using Prop. 1, solve LA() and obtain an optimal a.v. (x,y,z) }while(*) return(*)
Time complexity: the product of the number of iterations and the time complexity of S-W algorithm • Space complexity: O(m)
Position of an optimal alignment may also be desired. • By extending the S-W algorithm to include, at each entry of the score matrix, information about the alignment path which ends at that node, and the starting node-position of the path.
RationalNLA Obj: For better time complexity
Introduction • Megiddo(1979): match:1-λ mismatch:δ- λ indel:μ- λ λ is a variable and can be precomputed
Precomputed Method • Binary search + criteria: • 若LA*(λ) = 0,那麼λ= NLA*,且LA(λ)的最佳alignment vector也是NLA的最佳解。 • 若LA*(λ) > 0,試較大的λ • 若LA*(λ) < 0,試較小的λ
Observation • Any two distinct candidate values for NLA* are not arbitrarily close to each other if the scores are rational
Proof(I) • Set Q(a, b)是NLA*的可能值集合 • Q(a, b) = {(x – δy - μz)/(2x + 2y + z + L) | (x, y, z)AV(a, b)} • PROPOSITION 2 • letσ = min{|q1 – q2| | q1, q2 Q(a, b), q1≠q2} • setδ= p/q與μ= r/s是有理數 • σ≧ 1/qs(m + n + L)^2
Proof(II) • set q1, q2 Q(a, b)分別是由alignment vectors (x1, y1, z1)與(x2, y2, z2)所得的正規化積分,q2 < q1 • σ≧ (x1 – δy1 – μz1)/(2x1 + 2y1 + z1 + L) – (x2 – δy2 – μz2)/(2x2 + 2y2 + z2 + L) • ∵對於兩個正有理數 p1/q1 > p2/q2 p1/q1 – p2/q2 ≧ 1/q1q2, and for any alignment vector (x, y, z) AV(a, b),2x + 2y + z ≦ m + n • σ≧ (1/qs)[(qsx1 –psy1 –qrz1)/(2x1 + 2y1 + z1 + L) – (qsx2 –psy2 –qrz2)/(2x2 + 2y2 + z2 + L)]≧1/qs(m+n+L)2
Algorithm • 計算σ 存在一區間[e,f] s.t. NLA* 落於[eσ , fσ] • Initially, e = 0 and f=1/2 σ-1 NLA* is in [0, ½) • Let k=(e+f)/2, iteratively solve parametric local alignment problem with parameter kσ • Interval is updated according to the signof the optimum value of the parametric problem