Post-processing long pairwise alignments

Post-processing long pairwise alignments 陳啟煌 93/4/28 Zheng Zhang et al., Bioinformatics Vol.15 no. 12 1999

Outline • Motivation • Theoretical basis of the proposed algorithms • How to build up Useful Tree • An application

Motivation • Avoid local alignment problems • Smith-Waterman lead to inclusion of an arbitrarily poor internal segment. • Others approaches may generate an alignment score less than some internal segment

C G G A T C A T 0 0 0 0 0 0 0 0 0 CTTAACT 0 8 5 2 0 0 8 5 2 0 5 3 0 0 8 5 3 13 0 2 0 0 0 8 5 2 11 0 0 0 0 8 5 3 13 10 0 0 0 0 8 5 2 11 8 The best score 0 8 5 2 5 3 13 10 7 0 5 3 0 2 13 10 8 18 Smith-Waterman approach

Inclusion of a poor segment • Inclusion of an arbitrarily poor region in an alignment • Smith-Waterman approach potential flaws.

X-Alignments • An X-Drop within an alignment, where X>0 is fixed in advance. • A region of consecutive columns scoring less than <-X • Alignments contain no X-Drop, we call X-alignments

hit Terminate if the score of the extension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.) BLAST In Blast Step 3: Extend hits.

2X-Drop

Non-normal alignment The HSP has been extended to the right side in such a way that the entire alignment score less than the section from a to b

The Proposed Approach • Provide techniques for decomposing a long alignment into sub-alignments that avoid the both problems. • Show how to scan an alignment to collect information from which a decomposition corresponding any X can be found almost instantaneously. • Provide a method for detecting variations in the rate of genome evolution

Useful Tree

X-full alignment • An alignment are normal if each of its prefixes or suffixes has non-negative score. • An alignment is not contained in any longer normal alignment is called full • X-alignment + maximal X-normal is called X-full

X-full alignment • 0-full alignment is maximal runs of columns of A with non-negative scores. • For every X, X-full alignments are pairwise disjoint. • If X<YX-full alignment contained in Y-full alignment. •  -full alignments are just full alignments

Useful Tree • Encode X-full alignments for all X≥ 0 in tree data structure. • Leaves: 0-full alignments & maximal runs of negative score columns alternately • Terminal Leaves: add two special leaves with score - • Each internal node is a disjoint union of its threechildren. Keep alignment’s score and the minimum sub-alignment’s score

Time complexity • Construct time: O(N) • Search Time: • If k such alignments,need inspect at most 3k+1 nodes • (2k+1) leaves+((2k+1-1)/2) internal nodes =3k+1 nodes

Decompose rules • Alignment A A1,A2,…., A2n-1 • # of sub-alignment is odd • i :score of Ai • Negative & Non-negative score alternately • 0=2n= -∞

Theoretical basis

Theoretical basis • Lemma1:X is consistent • Lemma2:A normal drop is consistent with X

Lemma 3

Useful tree definition • Each node of T is a segment consistent with X. • Each leaf of T is of the form [i,i+1) • Each internal node [a,d) has exactly three children. [a,b),[b,c) and [c,d) and the signs of their scores alternate.

Lemma 4

Possible negative merge • LEMMA 5. Assume that three consecutive roots in our sequence, [a,b),[b,c),and [c,d), satisfy • 0 ≤(b,c)< min(- (a,b),- (c,d)) • Then merging these trees into a single tree with root [a,d) creates a useful tree and the resulting sequence still satisfies P1 and P2. • If a,b,c and d satisfy this lemma,[a,d) is a possible negative merger.

Possible positive merge • LEMMA 6. Assume that five consecutive roots in our sequence, [a,b),[b,c),[c,d),[d,e) and [e,f) satisfy • 0 >(c,d) ≥ max( (a,b), (e,f)) • neither [a,d) nor [c,f) is a possible negative • Then merging these trees into a single tree with roots[b,c),[c,d),[d,e)into a single root[b,e) creates a useful tree and the resulting sequence still satisfies P1 and P2. • If a,b,c,d,e and f satisfy this lemma,[a,d) is a possible positive merger.

Lemma 7

Theoretical basis • Normal rise and normal drop • Useful Tree contains every segments • Possible negative merger • Possible positive merger • Always exists possible negative merger or possible positive merger

Decompose rules • Alignment A A1,A2,…., A2n-1 • # of sub-alignment is odd • i :score of Ai • Negative(odd i) & Non-negative(even i) score alternately • 0=2n= -∞

Useful Tree build up procedure 1.Push the first leaf on the stack 2.While the stack size exceeds 1 or there is an unvisited leaf do 3. if the top three stack items indicate a negative merger then 4. pop three items,merge them and push the result onto the stack 5. else if the top five segments indicate a positive merge then 6. pop an item{e,f} perform line 4. and push {e,f} back 7. else 8. push the next two leaves onto the stack

Construct Useful Tree • ACAACAGAAACT • | | || ||| • ATA--AG-CACT • Gop:0 • Gep:1 • Match/mismatch: 1/-1

Push 1 • Push 2,3 • Push 4,5, • Merge 2,3,4 as a • Merge 1,a,5 as b • Push 6,7 • Push 8,9 • Merge 6,7,8 as c • Push 10,11 • Merge 9,10,11 as d • Merge b,c,d as e

Source code of this paper • http://globin.cse.psu.edu/dist/decom/

Alignment file • a { • s 562 • b 1 1 • e 3 3 • l 1 1 3 3 99 • l 6 4 9 7 99 • l 11 8 12 9 99 • } • #:lav • d { • "simu elegans briggsae • M = 10, I = -10, V = -10, O = 60, E = 2" • } • s { • "s1" 1 12 • "s2" 1 9 • } • h { • ">SUPERLINK_RWXL 2782216-2889703" • ">dna -c briggsae.dna " • }

An Application • Different regions of a mammalian genome evolve at different rates. • Provide a method for detecting variations in the rate of genome evolution • To compare the rates of evolution in different genomic regions from humans and mice. • Align each pair of homologous regions and determined

Pitfalls • Tally statistics only at sequence not in exons • Regions adjacent to an exon maybe be aligned • Remove the exons before producing the alignment • The alignment program is unable to differentiate the biologically meaning alignment

Proposed approach • First align the sequences using the exons as guideposts • Then re-score the alignment where positions within exons are masked,so that they cannot be aligned to another nucleotide.

References • Zheng Zhang et al., “Post-processing long pairwise alignments”,Bioinformatics, Vol.15 no. 12 1999 • http://globin.cse.psu.edu/dist/decom/ • Kun-Mao Chao ,Algorithms for Biological Sequence Analysis Lecture Notes, National Taiwan University, Spring 2004

Q&A • Thank you!

Possible mistakes, but maybe not • P.1015 left col., last 2 row ∑ k=1  ∑ k=i • P.1015 Right col. [i,i) should be [i,j) • P.1016 proof of lemma4 4 [i,i) should be [i,j) • P.1017 proof of lemma5 (b,c)  (e,c) should be (b,c)- (e,c) • P.1017 lemma7 (ai-3,ai-2) (ai-4,ai-1)

Lemma1:X is consistent Proof 1

Proof of lemma2

Lemma 3

Proof of lemma3

Lemma 4

Proof of lemma4

Possible negative merge • LEMMA 5. Assume that three consecutive roots in our sequence, [a,b),[b,c),and [c,d), satisfy • 0 ≤(b,c)< min(- (a,b),- (c,d)) • Then merging these trees into a single tree with root [a,d) creates a useful tree and the resulting sequence still satisfies P1 and P2. • If a,b,c and d satisfy this lemma,[a,d) is a possible negative merger.

Proof of lemma5

Post-processing long pairwise alignments

Post-processing long pairwise alignments

Presentation Transcript

Pairwise sequence alignments

Pairwise and multiple sequence alignments

Pairwise Alignments

Bioinformatics 01 Part 3: Pairwise Alignments and Database Searches

Database search and pairwise alignments

Pairwise Alignments

Post- processing

Pairwise Alignments and Sequence Similarity-Based Searching

Bioinformatics Part 3: Pairwise Alignments and Database Searches

Pairwise Sequence Alignments

Pairwise sequence alignments

Pairwise alignments

Pairwise sequence alignments

Pairwise Alignments and Database Searches: Algorithms

Pairwise Sequence Alignments

Chapter 2 Data Searches and Pairwise Alignments

Pairwise Alignments Part 1

Post Processing

Pairwise Alignments Part 1

The biological meaning of pairwise alignments

Pairwise Alignments

Pairwise alignments