300 likes | 425 Views
Multiple Sequence Composition Alignment. Name: Yip Chi Kin Date: 21-12-2006. Studied Papers. [B03] Composition Alignment. [S98] Divide-and-conquer Alignment. [M99] DIALIGN Algorithm. [SMS03] DCA + Segment-based. Main Aspects. ․Dynamic Programming
E N D
Multiple Sequence Composition Alignment Name: Yip Chi Kin Date: 21-12-2006
Studied Papers [B03] Composition Alignment [S98] Divide-and-conquer Alignment [M99] DIALIGN Algorithm [SMS03] DCA + Segment-based
Main Aspects ․Dynamic Programming ․Composition Alignment ․Meta-code MSA ․Simultaneous MSA Pairwise Library (Global & Local) Consistency & Ungapped Divide-and-conquer Segment-based (Optimal scores)
Edit Graph CTG matches C T A deletions insertions CTGA • C T G A • • • Dynamic Programming Dot Matrix DP Matrix s(ai,bi) -d -d
-CTTCT - G C A T C 0 -2 -4 -6 -8 -10 -2 -1 -3 -5 -7 -9 -4 -1 -2 -4 -4 -6 -6 -3 -2 -3 -5 -5 -8 -5 -2 -1 -3 -4 -10 -7 -4 -3 0 -2 Global Alignment Needleman-Wunsch Algorithm Scoring GA Results G C A T C - - C T T C T
-TTTACAGGCAG - G A A C G G T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 2 0 0 0 0 2 0 2 0 1 1 2 0 0 0 0 0 2 1 2 1 0 0 3 1 0 0 0 0 0 4 2 1 0 2 1 2 0 0 0 0 0 2 3 4 3 2 1 3 0 0 0 0 0 0 1 5 6 4 2 3 0 2 2 2 0 0 0 3 4 5 3 1 GA Results - G A A C – G G T - - T T T A C A G G C A G Local Alignment Smith-Waterman Algorithm Scoring
MSA Methods ․Consistency-based ․Exact method ․Progressive method ․Iterative method ․Stochastic method ․Hidden Markov method
C G T C T C T G T C C C G T C T C T G T C C C G A T A T T C G A T A T T C G T C T C T G T C C C G T C T C T G T C C C G A T A T T C G A T A T T MSA Concepts Consistency-based method PSAs Trace formulation Latter formulation
C G T C T C - - - G T C T C - - T G T C C C G A T A T - T C G T C T C G T C T C C T T G G T T C C C C C G T C T C - - - G T C T C T G T C C C C G G A A T T A A T T T T C G A T A T T MSA Results Results of MSA Aligned regions Unrealized Consistent Realized
Divide-and-conquer C1 C2 C3 S1 S2 S3 Prefix Suffix Divide S1C1 S2C2 S3C3 C1S1 C2S2 C3S3 Divide Divide Align optimally Concatenate
Prefix Suffix Sequence: GTTCATGCCAGGTGTAAATC CTATAC- -CTATAC 3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 4 10 8 6 4 2 0 2 12 10 8 6 4 2 0 0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 5 10 8 7 5 3 2 3 G T A T C - - G T A T C DP Distance Wopt (prefix) Wopt (suffix) CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)
CTATAC 0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 12 11 8 4 0 1 4 8 15 12 8 4 0 0 4 19 15 12 8 4 1 0 G T A T C Additional-cost Cost of Diagonal CS1,S2[1,1] = 0 CS1,S2[2,2] = 0 CS1,S2[3,3] = 0 CS1,S2[4,4] = 0 CS1,S2[5,4] = 0 CS1,S2[6,5] = 0 CS1,S2[2,2] = 1 + 2 – 3 = 0 = Wopt[CT,GT] + Wopt[ATAC,ATAC] – Wopt[CTATC,GTATAC] CS1,S2[4,3] = 3 + 1 – 3 = 1 = Wopt[CTAT,GTA] + Wopt[AC,TAC] – Wopt[CTATC,GTATAC]
Space & Time ‘Chain’ of boxes along Diagonal in order to reduce searching time Full sequence searching
I A V L F A E L A V I F G Y Y I A V L F A E D I A V L F A E D V T F A E L A C V I F G S L A C V I F G S P W D D V T F D A E P W D D V T F D A E y - - d Y I A V L F A E D - c - s - L A C V I F G S p w d d d - P W D D V T F D A E DIALIGN Non-Consistent (Simultaneous) Non-Consistent (Cross over) Consistent diagonals GA Results
Y I A V L F A Y D D L A C V I F G S S W D D V M F Y A E Weighting Diagonal Weights where SD is sum of similarity values of same diagonal lD lD is length of diagonal D w(D) = – log P(lD, SD) Overlap weighting w(D1) = 1.9 w(D3) = 1.5 w(D2) = 1.7 w(D4) = 2.6 w(D5) = 0.2 Diagonals D1,D4and D5Score = 1.9 + 2.6 + 0.2 = 4.7 Diagonals D1,D2,D3and D5 Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3 Y I A V L F A Y D D Y I A V L F A Y D D L A C V I F G S L A C V I F G S S W D D V M F Y A E S W D D V M F Y A E
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 S1 S2 S3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 f1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 S1 S2 S3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 f2 f3 Consistency check Overlap weights Fragments checking Transitivity frontier [1,9]
M1 (1) M1 (2) M1 (2) M1 (1) M2(2) M2 M2(1) Greedy Strategy M1 (1) M1 (1) M1 (2) M1 (2) M2 M2 M3 M3 Greedy Approach Tandem duplications S1 S2 S1 S2 Consistency conflicts S1 S2 S3 S1 S2 S3
0 1 0 1 1 1 G 0 T C 1 0 0 A 0 G C 1 T C 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 1 1 C T G G C C T A + + – + – – – + – – – A A C T T T G 0 0 0 1 2 2 1 2 1 0 -1 0 -1 -2 -3 A G C A C T - Composition Alignment Composition matches Single character match CM of Prefix Length Sequence #1 Sequence #2 Matching Prefix length
Match Length 111010001 001101110 Replaced by 7 Replaced by 7 111010001 001101110 Composition Matching 3 2 2 2 Prefix length 1 0 4 9 15 –1 2 2 –2 Replaced by 2 –3
0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 Composition Matching Sequence #1 CM = 2 Sequence #2 CM = -1 Sequence #1 CM = 1 Sequence #2 CM of Prefix Length (Total=9) Sequence #1 CM = 0 Sequence #2
Meta-Code Code about code Mismatch code Input code Code Reservoir Match code Code for Testing Mismatch code Original Code Meta-Code Control Rule
Code ‘A’ in S1 Code ‘G’ Code ‘G’ Store code in Reservoir S1 If both Codes founded from Reservoir S1 and Reservoir S2 delete this two codes Code from S1 Code ‘T’ in S2 Code ‘C’ Code ‘C’ Store code in Reservoir S2 Code from S2 Code ‘AG’ Code ‘CT’ Reservoir Code (e.g. AGRCT) Reservoir Codes
Meta-Code Rule If reservoir code = r, then stop the looping Looping for creating meta-code If CM length is valid, reservoir code = r, Position = p. Value of r Values of r and p Copy the codes from S1 and S2, p = p –1, output meta-code. Meta-code (e.g. AMT) Codes from S1 and S2
S1: T A A C A G A G A T A C A G G A G T A C G G G A A C G A T S2: T T C T T T T G T T C C T C C C C C G A C C T T C T C Length 0 1 1 2 1 1 2 2 1 1 2 2 Meta Code R ART ART AGRTT ART ART AGRCT AGRCC ARC GRC GARTC GARTC CM (Lengths & Codes) Composition Matching of S1 and S2 in prefix length Reservoir codes in S1 Reservoir codes in S2
CM of Metacode Invalid length AGRTT GARTC Composition Matching Invalid length ART ART ARC 2 1 Prefix length 0 10 12 2 6 4 2 –1 ART GARTC ART AGRCT
T T A A C C G G T T C G C T G C T G C A G C A C | T T | C T G | C C C G A | T C T T T T C C T TMG G GMT C C C C C G CMT A GMC T AMG C TMA C T T C G T C C T C G A C Composition MSA Composition matching Meta-code MSA S1 S2 New S2
Code catalogue 1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2 A = Currency / Cards B = Stock / Structured P. C = Unit Trusts / Bonds D = Insurance / Finance E = Mortgages / Loans … Time Granularities Branch bank #1 A C B C E B A A E B C A … Branch bank #2 B E B A A A C E D B E A … Branch bank #3 A C A A B C E E D B E E … Week #2 Week #1 Fixed Segment ․Semi-global alignment ․Least overlap problem ․Simple segmentation ․Composition alignment ․Weekly behaviour Segment Length LS= 5
C C C C B B A C D A C D A A A A B B A A D D C C A A B A A B A A B B B B Meta-Code Branch bank #1 Branch bank #2 PSA Branch bank #1 Meta-Code Branch bank #3 Branch bank #2 Branch bank #3 Family Group Composition alignment Fixed-Segment Composition MSA Family Classifications Family Group
Further Problems Meta-Code Composition MSA ․Fixed-segment length ․Prior sequence choice ․Speed-up PSAs ․Nos. of Segments/Codes
Conclusions ․Fixed-segmentComposition (Least Overlap Problems) ․Meta-code Approach (Easier Transform Applications) ․Widespread use of MSA (Simultaneous Multiple Sequences)