400 likes | 641 Views
Space-Efficient Data Structures for Top- k Completion. 蔡晓华. Outline. Motivation Completion Trie RMQ Trie Score- demonposed Trie Q & A. Motivation: focus on the case where the string set is so large that compression is needed to fit the data structure in memory. Contribution:
E N D
Outline • Motivation • Completion Trie • RMQ Trie • Score-demonposedTrie • Q & A
Motivation: focus on the case where the string set is so large that compression is needed to fit the data structure in memory Contribution: present three different trie-based data structures to address this problem
Definition: Scored string set the completion suggestions are drawn from a set of strings, each associated with a score. We call such a set a scored string set Problem: top-k completion Given a string p and an integer k, a top-k completion query in the scored string set S returns the k highest scored pairs in S
Completion Trie Trie : Each edge represents a single character in the simple trie Compacted Trie : Allows a sequence of characters to be associated with each edge(except root)
Completion Trie Score : (1) Assign to each leaf node the score of the string it represents (2) Assign to each intermediate node the maximum socre among its descendant leaf nodes By construction , the score of each non-leaf node is simply the maximum score among its children
Completion Trie Compacted trie with max scores in each node
Completion Trie The way to find top-k completions : Find the locus node with input string prefix p b) Add the locus node into a priority queue c) If the node is a leaf node , return; else insert all children of each expanded node to the priority queue
Completion Trie Improvement Instead of inserting all children of each expanded node to the priority queue , sort the children by order of decreasing score Result Only need to add the first child and the next sibling (if any)
Completion Trie • Reduce the time complexity to find the top-k completions • In practice , reduce the number of comparisons needed to find the locus node
CompletionTrie Input : prefix = c k=2 Find locus node : C String =caca String =cbac
算法优点 • 传统的Trie树,每个叶子节点存储与String对应的score,解决top-K问题时,需要找到所有满足这个prefix的叶子节点,然后动态排序,再返回前K个元素 • 问题:当prefix较短时,返回的结果很多,排序耗时也占用空间 • 一种方案是提前对数据进行处理,找到每个prefix的K个completion,然后对应存储 • 问题:需要提前知道K的大小,而且K是固定的
Compressed Encoding Motivation • Improve the theoretical time complexity • Improve the locality of memory access (random access to RAM and hard drive is much slower than that from CPU cache)
Compressed Encoding Two strategies BFS: when finding the locus node, store each group of child nodes consecutively Access the next sibling is less likely to incur a cache miss
Prefix=c Cache size=2
DFS: encoding in DFS order As each internal node is assigned the maximum score of its children and the children are sorted by decreasing score, following the first child is guaranteed to reach a leaf node matching the score of an internal node Typically incur only one cache miss per completion
Encoding for each node • Character sequence associated with its incoming edge • Score • Whether it is the last sibling • An offset pointer to its first child( If not , put 0) (L+1)+4+1+4=l+10 bytes
Variable-byte encoding to scores and offsets • Store only score difference between the current node and its previous sibling • Store the delta offset between first child offset and its previous siblings
ImplementationDetails How to get the string match the leaf node ? DFS: reconstruct the string by starting from the root node and iterative finding the child whose subtrie node offset range includes the target leaf node Reduce the cost by keeping additional bookkeeping in the search algorithm • Store the nodes to be inserted into the queue in an array ,along with the index of its parent node in the array • We can retrieve the path from each completion node to the locus node by following the parent indices
End of the First Section Questions?
RMQ Trie What is RMQ ? • RMQ is short for Range Minimum Query data structure • Maps a set of strings to consecutive integers in lexicographic order
RMQ Trie • If the string set S is represented with a trie, the set of strings prefixed by p is a subtrie • If the scores are arranged in DFS order within an array R, the scores of Sp are those in an interval R[a,b] • PrefixRange(p) : an operation ,given p, return the pair (a,b) or null
RMQ Trie Build an RMQ data structure on top of R using an inverted ordering i.e. the minimum is the highest score strategy • The index of the completion is i=RMQ(a,b) • The second completion is the one with highest score among RMQ(a,i-1) and RMQ(i+1,b) • Recursive splitting • In general, the index of the next completion is the highest scored RMQ among all the intervals • Maintaining the intervals in a priority queue orderd by score
RMQ Trie • Advantage • Simplicity and modularity (re-use an existing dictionary data structure without any significant modification) • Disadvantage • Hard to implement the operation PrefixRange • The cost of PerfixRange is significantly worse
Score-Decomposed Trie • Path decompositions : • Let T be the trie built on the strings of the scored string set S. A path decomposition of T is a tree Tc whose nodes correspond to node-to-leaf paths Π in T and associating it with the root node of Tc; the children of root node are defined recursively as path decompositions of the subtries hanging off the path Π
Score-Decomposed Trie • Find a root-to-leaf path • Let the path be the root node of the new TrieTc • Recursively define the children of the root node
Score-Decomposed Trie Note that while each string s in S corresponds to a root-to-leaf path in T, in Tc it corresponds to a root-to-node path. • Max-score path decomposition • It is a way to choose a path • Choose path as the one to the leaf with the highest score .The subtries at the same level are arranged in decreasing order of score (the score of a subtrie is defined as the highest score in the subtrie)
Score-Decomposed Trie • R represents the highest score in the subtrie rooted at Vi • Add r to the label of the edge leading to the corresponding child, such that the label becomes the pair (b,r)
Score-decomposed Trie example 第一条路径root +B B为叶子节点,路径结束,可以看到ab有两个兄弟,故2ab 分解树中边是原来的节点 如第二层的c,3和b,2 以左边c,3为例,走过路径后,递归寻找subtrie的路径,即C节点+E+G,路径是aca,因为ac和a各有一个兄弟,所以是1ac1a G只有一个兄弟,下一个路径就是H自己,拆分两个的CC,路径是c,1,节点是C 剩下的也一样 注意的就是最右边那个K节点,里面的字符就直接用的1,因为是空
2ab c,3 b,2 1ac1a 1 c,1 b,2 b,1 c 1ac a b,1 a
Score-Decomposed Trie How to support top-k completions enumeration ? 1)Because of the max-score decomposition strategy , the highest score in each subtrie is exactly the score of the decomposition path for that subtrie. 2)The tree has the heap property : the score of each node is less or equal to the score of its parent
How to support top-k completions enumeration ? 3)This implies that for each (s,r) in S, if u is the node corresponding to s, then r is stored in the incoming edge of u, except when u is the root, whose score is stored separately.
Score-Decomposed Trie • First, follow the algorithm of the Lookup operation until the prefix p is exhausted , leading to the locus node u, the highest node whose corresponding string contains p. (report it) • Find the next completions : prefix p ends at some position in Lu. Thus all the other completions must be in the subtrees whose roots are the children of u branching after position I • Extract the highest scored node from the priority queue , report the string corresponding to it ,and add all its children to the priority queue.
Prefix = cac k=2 Locus node : 1ac1a caca caccc
Thank You! Questions ?