330 likes | 351 Views
Discover missing patterns in a given text using sparse suffix tree. Specifically, find a pair of patterns with the minimum total length that do not appear in the text.
E N D
Shunsuke Inenaga Teemu Kivioja Veli Mäkinen Department of Computer Science, University of Helsinki, Finland
Finding Missing Pattern • Finding Missing Pattern • Definition • How to Find “Missing” Pattern? • Algorithm with Suffix Tree • Finding Missing Pattern Pair • Finding Missing Pattern Pair of Same Length • Experiments
Finding Missing Pattern Input : text T Query : Find the shortest pattern AS* which does not appear in T. S: set of characters inT text: For a decade, pattern discovery has played a central role in bioinformatics. Especially extracting surprising and useful patterns is a core of knowledge discovery and from textual data. One extreme example of surprising patterns is missing patterns, namely, patterns that do not appear in a given text T are to be discovered. Amir et al. introduced a generalized version of the missing pattern problem in such a way that pattern P may ‘approximately’ occur in T. They call this problem the inverse pattern matching problem. Some improvements for discovering inverse patterns appeared in literature. Another related work is the farthest substring problem by Lanctot et al, where a set of text strings is considered as input. HOW?
Example String “WABISABI” keyword of beauty of Japanese classic culture
Using Suffix Tree T: WABISABI$ S= {A, B, I, S, W} 1 2 3 4 5 6 7 8 9 A $ B I B W I 9 A I S B S A I S B $ A A I S B B S $ I A $ $ I A $ 6 B $ B I 1 5 I 7 $ 8 $ 2 3 4
Using Suffix Tree T: WABISABI$ S= {A, B, I, S, W} 1 2 3 4 5 6 7 8 9 A,I,S,W A $ B I B W I 9 A I S A,B,S,W B S A I S B $ A A I A,B,I,W S B B S $ I B,I,S,W A $ $ I A B,I,S,W $ 6 B $ B I 1 5 I 7 $ 8 $ 2 {AA,AI,AS,AW,BA,BB,BS, BW,IA,IB,II,IW,SB,SI, SS,SW,WB,WI,WS,WW} 3 4
Complexities • Suffix Trees can be constructed in O(n) time and space. • Missing patterns can be found in O(n) time and space.
Finding Missing Pattern Pair • Finding Missing Pattern • Finding Missing Pattern Pair • Definition • Biological Motivation – Nested PCR • Algorithm with Sparse Suffix Tree • Improved Algorithm • Finding Missing Pattern Pair of Same Length • Experiments
Finding Missing Pattern “Pair” Input : text T Query : Find a pair of patterns (A,B) S* S* which does not appear a-close in T, and |A|+|B| is minimum. S : set of characters inT Patterns A and B are said to be a-close (w.r.t. T) if one or more occurrences of A and B are closer than or equal to a in T.
If A and B are not a-close, (A,B) is said to be a missing pair. a a T B A a a a a T B A a a T A a-close and not a-close Case 1: a-close Case 2: not a-close Case 3: not a-close
Biological Motivation PCR (Polymerase Chain Reaction) • Standard technique to produce many copies of a region of DNA (can be a tiny sample). • In Medicine, to detect infections. • In Forensic Science, to identify individuals.
Nested PCR Nested PCR • Repeated PCR with nested primers • Achieving ultrasensitive detection • Good adapter primers for nested PCR: bind only to the adapters, and amplify nothing directly from the samples!
Nested PCR 5’ 3’ S(sample) Adapter Right specific primer Left specific primer Adapter 3’ 5’ S’(complement to S) a • We want a pair of good adapter primers • which amplify nothing directly from S or S’. • (Adapter primers are complements to adapters.) • If (A,B) is a missing pair in SS’, then (A’,B) is not a pair • of binding sites for any region of length less than a. (A’,B) : good adapter primers!!
Finding Missing Pattern Pair • Finding Missing Pattern • Finding Missing Pattern Pair • Definition • Biological Motivation – Nested PCR • Algorithm with Sparse Suffix Tree • Improved Algorithm • Finding Missing Pattern Pair of Same Length • Experiments
Algorithm Build suffix tree for T; Construct set P for candidates of A; For each pattern A in P Build sparse suffix tree (SST) on Zone(A);B := shortest missing pattern in SST; Return pair (A,B) of minimum total length.
A $ B I B W I A I S B S A I S B $ A A I S B B S $ I A $ $ I A $ B $ B I I $ $ Example of set P for A P= {A, B, I, S, W, ABIS, BIS, IS} |P| 2n + 1
Sparse Suffix Tree Suffix Tree WABISABI$: 1 ABISABI$: 2 BISABI$: 3 ISABI$: 4 SABI$: 5 ABI$: 6 BI$: 7 I$: 8 $: 9 e:10 WABISABI$: 1 ABISABI$: 2 BISABI$: 3 ISABI$: 4 SABI$: 5 ABI$: 6 BI$: 7 I$: 8 $: 9 e:10 Sparse Suffix Tree • Suffix Tree (ST): represents the set of all suffixes of text T. • Sparse Suffix Tree (SST): represents a subset of suffixes of text T. (Kärkkäinen and Ukkonen ’96, Andersson et al. ’99)
Sparse Suffix Tree 10 A $ B I B W I 9 A S B I S A I S B $ A S A I S B B A $ I A $ $ I B $ 6 B $ I I 1 $ 5 7 WABISABI$: 1 ABISABI$: 2 BISABI$: 3 ISABI$: 4 SABI$: 5 ABI$: 6 BI$: 7 I$: 8 $: 9 e:10 $ 8 2 3 4
Zone(A) = {j : p - a j p + a, p Occ(A)}. a a Zone(A) ST(T) A = label(root,v)c v c T Zone(A)
Build SST on Zone(A) SST on Zone(A) B= label(root,u)d u d Zone(A) u : minimum depth node (possibly implicit) d : any character in S that is unused in out-going edges from u.
Complexity • Zone(A) for each A: O(n) time. • SST for each A: O(n) time by pruning ST(T). • Second pattern B for each A : O(n) time. • We have O(n) candidates for A. • Total : O(n2) time & O(n) space.
Finding Missing Pattern Pair • Finding Missing Pattern • Finding Missing Pattern Pair • Definition • Biological Motivation – Nested PCR • Algorithm with Sparse Suffix Tree • Improved Algorithm • Finding Missing Pattern Pair of Same Length • Experiments
We can restrict to patterns of length k logsn. Improved Algorithm Lemma: If sk > n, then there always exists a single missing pattern of length k. (s = |S|)
Improved Algorithm ST(T) • length(v) = k logsn k v • path(root,v) has at most k • patterns for A (for Zone(A)). a a • p can belong to at most • (2a + 1)logsn different SSTs. T p • p can belong to at most 2a + 1 • different leaves (Zones). • Total time & space : • O(anlogsn)
Space Reduction But O(anlogsn) space is too much!! OK, we can reduce it to O(nlogsn).
In the DFS on ST(T), we keep at most k logsn SST’s. Incremental Construction of SST ST(T) SSTv v SSTu u • SSTu can be reconstructed from SSTv in linear time. O(nlogsn) space • Each SST occupies O(n) space.
Finding Missing Pattern Pair • Finding Missing Pattern • Finding Missing Pattern Pair • Finding Missing Pattern Pair of Same Length • Restricted Case • Directly Motivated by Nested PCR • Algorithm with Bit Table • Experiments
For Patterns of Same Length • Assume |A| = |B|. • Directly motivated by the nested PCR application. • Previous algorithms can easily be adapted to this case. • Any more efficient solutions in this special case?
Using Bit Tables • Use a bit table of size s2k. • Mark all entries that correspond to a-close pattern pairs in T. • An unmarked entry corresponds to a missing pair. • Running time : O(s2k + an) for a fixed k. • For k = 1, 2, … : O(s2an + anloglogsn) time. • Result :O(anloglogsn) time.
Summary of Results time space Algorithm 1 O(n2) O(n) Algorithm 2 O(anlogn) O(nlogn) Algorithm 3 O(anloglogn) O(an) (on constant alphabet)
Finding Missing Pattern Pair • Finding Missing Pattern • Finding Missing Pattern Pair • Finding Missing Pattern Pair of Same Length • Experiments • Implemented Bit Table Algorithm • With Yeast Genome
Preliminary Experiments • Implemented the bit-table algorithm. • Tested with yeast genome. • Set a = 5000 (realistic value) • Found missing pattern pairs with k = 8. • Ultimate goal: Human Genome!! (feasible, but needs much time and space)
Thanks to… K. Kataja & R. Satokari (VTT Biotechnology) Juha Kärkkäinen (University of Helsinki) Jens Stoye, Sven Rahmann, Sebastien Böcker, & Matthias Steinrucken (Bielefeld University) . .