310 likes | 409 Views
Sequence Local Alignment using Directed Acyclic Word Graph. Do Huy Hoang. Sequence Alignment. Sequence Similarity. Alignment Arrange DNA/Protein sequences to show the similarity “” denotes the insertion/deletion event. Other variations. Edit distance Longest common substring
E N D
Sequence Local Alignment using Directed Acyclic Word Graph Do Huy Hoang
Sequence Similarity • Alignment • Arrange DNA/Protein sequences to show the similarity • “” denotes the insertion/deletion event
Other variations • Edit distance • Longest common substring • Affine gap scoring • Using scoring matrix (BLOSUM, PAM)
Alignment score computation • Needleman–Wunsch • Dynamic programming
Local alignment • Local alignment • Find the best alignments of two substring from the sequences
BWTSW • Motivation • Scoring 75% similarity • Local alignment table most are zero • Meaningful alignment • Suffix tree • Meaningful alignment • Meaningful alignment with gap • How good is it?
Meaningful alignment (1) • Sequences similarity sometimes implies functional similarity. • Biologists is NOT usually interested in sequences with less than 70% similarity. • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending gap = -2
Meaningful alignment (2) • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending Gap = -2 • At least 70% match to have none zero score
Meaningful alignment (3) • BLAST score • Match = 1 • Mismatch = -3 • Open Gap = -5 • Extending Gap = -2 • How many none zero entries in the local alignment DP table?
How to improve? • Idea: • Not storing zero score entries • Using suffix tree to prune off early
BWTSW details • FM index for suffix tree representation • Prune zero entries • Store DP vector using linked list
Analysis • Text length = N • Pattern length = M • Alphabet size =
Average running time (1) • Let F(L) be the number of pairs of strings length L, which Score(S1,S2) > 0 • Sizeof{(S1,S2) : Len(S1)=Len(S2)=L, Score(S1,S2)>0} • F(L) counts the number of pairs of 75% identity. • F(L) = sum(i=0..L/4, Binomial(L,i) * (-1)i) • F(L) k1k2L • F(log(N)) k3* N0.68
Average running time (2) • Given S1, Pr(Score(S1,S2) > 0|S1) = F(L)/L • For M < log(N) • The number of entries are • O(M * F(M)) < O(log(N)*F(log(N)) • For M > log (N) • O(M * N * F(M) / L) • On average • Time = O(M*F(log(N))) = M * N0.68
Possible improvement of BWTSW • Worst case running time O(N2 M) • When M=N • O(M N0.68+M3) When M is substring of N • What about ST vs. ST?
What we used in BWTSW is Suffix Trie (not suffix tree). • #Prove it# • Suffix trie has O(N2)nodes • DAWG is a similar structure with O(N) nodes
DAWG (2) • DAWG: Directed Acyclic Word Graph • DAWG is a cyclic automata that recognizes all the sub-strings of the given string.
DAWG (3) • Example: • DAWG of “abcbc”
DAWG (4) • End-set view
Trivial DAWG construction • Using End-set class
DAWG properties • For |w|>2, the Directed Acyclic Word Graph for w has at most 2|w|-1 states, and 3|w|-4 edges
D(w) and ST(wR) • There is a map between nodes in DAWG and implicit ST(wR) • Example: w=abcbc, wR=cbcba • Store DAWG using ST, which uses only o(N) bits a cb b a a cba cba
D(w) and ST(wR) (2) list all incoming edges of node q in Dw using ST(w^R)
Local Alignment using DAWG • Basis • Induction
Extensions • Meaningful alignment using DAWG • Prune the nodes whose Score is less than zero • Shortest path pruning style • Cache log(N) nodes the worst case running time is M*N*log(N), average case is the same for M << N.