160 likes | 273 Views
Suffix Trees. Purpose. Given a (very long) text R , preprocess it, so that once a query text P is given, we can efficiently find if P appears in R . (Later – also where P appears in R ). Example R = “ HelloWorldWhatANiceDay ” , IsIn( “ World ” ) = YES, IsIn( “ Word ” ) = No
E N D
Purpose • Given a (very long) text R, preprocess it, so that once a query text P is given, we can efficiently find if P appears in R. • (Later – also wherePappears in R). • Example R=“HelloWorldWhatANiceDay”, • IsIn(“World”) = YES, • IsIn(“Word”) = No • IsIn(“l”)=8 YES (note – appears more than once)
Definition: A suffix • For a word R, a suffix is what is left of R after deleting the first few characters. • All the suffixes of R=“Hello” • Hello • ello • llo • lo • o
Alg for answering IsIn • Preprocessing: • Create an empty trie T. • Given R=“HelloWorldWhatANiceDay”, insert into T all suffixes of R. • Answering IsIn(P): • Just check if P is in T • That is, return find(P). • (Here, find is as studied in the lecture on tries)
Example • T=“hello”. Suffixes: “hello”, “ello”, “llo”, “lo”,”o”. o h e l l o e l o l l o l Examples: P=“ll” o
Lets get greedy • Given a (very long) text R, preprocess it, so that once a query text P is given, we can find the location of P in R (if at all) efficiently. • More specifically, report the index of where P starts to appear in R. • (If more then one answer, report the last one). • Example R=“HelloWorldWhatANiceDay”, • Where(“World”) = 5, that is, the answer is 5, since “World” appears starting at index 5 in R. • Where(“Word”) = NoWhere • Where(“l”)=8 (also in other places)
Alg for answering Where • Modify the trie, so that each node also contains a field b_inx. • When inserting a word s to the trie, whose first character is in index k of R, modify to nodes along the insertion path to contain the value k. • Preprocessing: • Create an empty trie T. • Given R=“HelloWorldWhatANiceDay”, insert into T all suffixes of R. • Answering IsIn(P ): • Just check if P is in T • That is, return find(P), and the value of b_inx where the search terminates. • (Here, find is as studied in the lecture on tries) • Resulting DataStructure is called: • Uncompressed Suffix Tree
Example Examples: P=“ll” • T=“hello”. Suffixes: “hello”, “ello”, “llo”, “lo”,”o”. o h e l 4 \ 2 3 b_inx=0 1 l o e l 3 b_inx=2 b_inx=0 1 o l l b_inx=2 1 b_inx=0 o l 1 b_inx=0 o b_inx=0
So much memory ????? • The problem with this data structure results from long paths: A sequence of nodes, each but the last one has a single child, and all has the same value of b_inx. o o h h e e l l 4 \ 2 3 b_inx=0 b_inx=0 1 l o e e l 3 b_inx=2 b_inx=0 b_inx=0 1 o l l l b_inx=2 1 b_inx=0 b_inx=0 o l l 1 b_inx=0 b_inx=0 o o b_inx=0 b_inx=0
0 0 0 0 1 1 More examples of paths
Solution • Recall that all strings in the tree are suffixes of the same text R. • Add a new field to each node, called c_inx and lng such that if lng>0 then when computing a string, we need to concatenate lng chars from P starting at position c_idx o h e h e l o c_idx=1, lng=4 l b_inx=0 b_inx=0 b_inx=0 e e b_inx=0 b_inx=0 R=“h e l l o” 0 1 2 3 4 --------- l l b_inx=0 b_inx=0 l l b_inx=0 b_inx=0 o o b_inx=0 b_inx=0
Compressing the tree • Assuming we are visiting nodes v of the tree, whose distance (num of edges) from the root in the uncompress trie is k. • Also assume that v is the first node on a path. • Then c_idx = b_idx + k. • So the function compress_tree should `know’ the distance from the root (in the uncompress tree) of the visited node.
Need a function compress_tree that accepts a node v of the tree, and the depth of v in the uncompressed tree. • Also need the function check_path( NODE *p) returning the length (in # edges) of the path starting at *p. So for example if *p has two children, it returns 0;
Compressing the tree – cont’ • compress_tree( NODE * p, int depth){ • for each cell ar[i] of *p • if ( (d = check_path (p->ar[i] ) ) > 0 ){ • Let q be a pointer to the node at the end of the path. Let h be the length of the path and let d be the depth of q (in the uncompressed tree). Both q, d and h should be obtained from check_path (think how) • Set p->ar[i]=q • Free unused nodes • q -> c_idx = q -> b_idx+depth+1 • q -> lng = h • compress_tree( q, d ) • }
How large is the tree now • Lemma: If T is a tree with no node of degree 1, then the number of nodes is O(number-of-leaves) • In our scenario, number-of-leaves<|R| • So the size of the trie is O(|R|).