350 likes | 538 Views
CPM June 2008. 2. Example of a Succinct Data Structure: The (Static) Bounded Subset. Given: Universe of n elements [0,...n-1]and m arbitrary elements from this universeCreate: a static structure to support search in constant time (lg n bit word
E N D
1. CPM June 2008 1 Succinct Data Structures: Upper, Lower & Middle Bounds Ian Munro
University of Waterloo
Joint work with/of Arash Farzan, Alex Golynski, Meng He
How do we encode a large combinatorial object (e.g. a tree, string, graph, group)
even a static one
in a small amount of space & still perform required operations in constant time ???
2. CPM June 2008 2 Example of a Succinct Data Structure: The (Static) Bounded Subset Given: Universe of n elements [0,...n-1]
and m arbitrary elements from this universe
Create: a static structure to support search in constant time (lg n bit word & usual ops)
Using: Essentially minimum possible # bits
Operation: Member query in O(1) time
(Brodnik & M.)
3. CPM June 2008 3 Beame-Fich: Find largest less than i is tough in some ranges of m(e.g. m2 vlg n)
But OK if i is present this can be added (Raman, Raman, Rao) Careful .. Lower Bounds
4. CPM June 2008 4
5. CPM June 2008 5 A Big Patricia Trie / Suffix Trie
Given a large text file; treat it as bit vector
Construct a trie with leaves pointing to unique locations in text that match path in trie (paths must start at character boundaries)
Skip the nodes where there is no branching (n-1 internal nodes)
6. CPM June 2008 6 Abstract data type: binary tree
Size: n-1 internal nodes, n leaves
Operations: child, parent, subtree size, leaf data
Motivation: Obvious representation of an n node tree takes about 6 n lg n words (up, left, right, size, memory manager, leaf reference)
i.e. full suffix tree takes about 5 or 6 times the space of suffix array (i.e. leaf references only) Space for Trees
7. CPM June 2008 7 Start with Jacobson, then others:
There are about 4n/(pn)3/2 ordered rooted trees, and same number of binary trees
Lower bound on specifying is about 2n bits
What are the natural representations?
8. CPM June 2008 8 Use parenthesis notation
Represent the tree
As the binary string (((())())((())()())): traverse tree as ( for node, then subtrees, then )
Each node takes 2 bits Arbitrary Ordered Trees
9. CPM June 2008 9
Heap-like Notation for a Binary Tree
10. CPM June 2008 10 How do we Navigate? Jacobsons key suggestion:Operations on a bit vector
rank(x) = # 1s up to & including x
select(x) = position of xth 1
So in the binary tree
leftchild(x) = 2 rank(x)
rightchild(x) = 2 rank(x) + 1
parent(x) = select(?x/2?)
11. CPM June 2008 11 Rank & Select Rank: Auxiliary storage ~ 2nlglg n / lg n bits
#1s up to each (lg n)2 rd bit
#1s within these too each lg nth bit
Table lookup after that
Select: More complicated (especially to get this lower order term) but similar notions
Key issue: Rank & Select take O(1) time with lg n bit word (M. et al)
12. CPM June 2008 12 Aside: Dynamic Rank & Select Rank/Select Structures: Raw data plus some cumulative arrays
Model: We keep a finger at a position and can insert/delete change at that spot or move 1 spot left/right
When at position i maintain structures up to i and backwards from n down to i+1.
Problem: in most (tree) applications rank/select updates are all over
13. CPM June 2008 13 Lower Bound: for Rank & for Select Theorem (Golynski): Given a bit vector of length n and an index (extra data) of size r bits, let t be the number of bits probed to perform rank (or select) then: r=O(n (lg t)/t).
Proof idea: Argue to reconstructing the entire string with too few rank queries (similarly for select)
Corollary (Golynski): Under the lg n bit RAM model, an index of size ?(n lglg n/ lg n) is necessary and sufficient to perform the rank and the select operations.
14. CPM June 2008 14 More on Trees Updating trees: simple mapping plus rank/select does not work well
Other kinds of trees: free trees (no root or ordering on children), a simple mapping may not exist
So break tree into little hunks (say (1-e) lg n size), small enough to explicitly keep in a table, with special constraints (e.g. few edges going out of a hunk)
15. CPM June 2008 15 More on Trees Keep most nodes in these little hunks (or a couple of levels of hunk size classes), a limited number can be in a core tree with real pointers
16. CPM June 2008 16 Hunks Lead to Updates on binary trees (M., Raman & Storm), & more general trees (Farzan & M.)
Also representing
special classes of trees
optimally (Farzan & M.)
e.g. free trees 1.56..n bits,
free binary trees 1.31..n bits
17. CPM June 2008 17 Planar Graphs (Lu et al, Barbay et al))
Permutations [n]? [n]
Or more generally
Functions [n] ? [n] But what operations?
Clearly p(i), but also p -1(i)
And then p k(i) and p -k(i)
Suffix Arrays (special permutations) in linear space
Arbitrary Graphs (Farzan & M.) Other Combinatorial Objects
18. CPM June 2008 18 Let P be a simple array giving p; P[i] = p[i]
Also have B[i] be a pointer t positions back in (the cycle of) the permutation;
B[i]= p-t[i] .. But only define B for every tth position in cycle. (t is a constant; ignore cycle length round-off)
So array representation
P = [8 4 12 5 13 x x 3 x 2 x 10 1]
1 2 3 4 5 6 7 8 9 10 11 12 13 Permutations: Backpointer Notation
19. CPM June 2008 19 In a cycle there is a B every t positions
But these positions can be in arbitrary order
Which is have a B, and how do we store it?
Keep a vector of all positions: 0 = no B 1 = B
Rank gives the position of B[i] in B array
So: p(i) & p -1(i) in O(1) time & (1+e)n lg n bits
Theorem: Under a pointer machine model with space (1+ e) n references, we need time 1/e to answer p and p -1 queries; i.e. this is as good as it gets in the pointer model. Representing Shortcuts
20. CPM June 2008 20 Consider the cycles of p
( 2 6 8)( 3 5 9 10)( 4 1 7)
Bit vector indicates start of each cycle
( 2 6 8 3 5 9 10 4 1 7)
Ignore parens, view as new permutation, ?.
Note: ?-1(i) is position containing i
So we have ? and ?-1 as before
Use ?-1(i) to find i, then bit vector (rank, select) to find pk or p-k Aside: Extending to powers of p
21. CPM June 2008 21 Consider an arbitrary function, f:[n]?[n]
Note: f-1(i) is a set
All tree edges lead to a cycle
A function is just a hairy permutation
Deal with level ancestors, result holds Aside: Functions
22. CPM June 2008 22 This is the best we can do for O(1) operations
But using Benes networks:
1-Benes network is a 2 input/2 output switch
r+1-Benes network join tops to tops
#bits(n)=2#bits(n/2)+n=n lg n-n+1=min+O(n)
Back to p & p-1 in Fewer Bits
23. CPM June 2008 23 Realizing the permutation (std p(i) notation)
(3 5 7 8 1 6 4 2)
Note: O(n) bits more than necessary A Benes Network
24. CPM June 2008 24 Divide into blocks of lg lg n gates encode their actions in a word. Taking advantage of regularity of address mechanism
and also
Modify approach to avoid power of 2 issue
Can trace a path in time O(lg n/(lg lg n)
Beats previous lower bound by using micro pointers What can we do with it?
25. CPM June 2008 25 Recall: Benes method violates the pointer machine lower bound by using micropointers.
Indeed: With (a lot of) care, space required is
lg(n!) + O(n (lg lg n)2/lg n) bits
But more general
Lower Bound (Golynski): Both methods are optimal for their respective extra space constraints Backpointers & Benes: Both are Best
26. CPM June 2008 26 Operations: p(i), p-1(i) with times t and t
Backpointers: natural + index
Benes: just a pile of bits, in lg n bit words
General Model: memory (lg(n!)+r bits in words
Lower bound: r = extra space = O(lg n!/tt)
It works out both Backpointers and Benes are optimal Permutation Lower Bound
27. CPM June 2008 27 Model: Tree program
Separate tree for each p(i) or p-1(i)
Start at root, look at memory location (word) based on value required
At depth d take appropriate
branch based on which of n
values is read Proof of Lower Bound: Model
28. CPM June 2008 28 Fix the permutation (for now)
Consider table of locations inspected at every step for every query Proof of Lower Bound: Set up
29. CPM June 2008 29 Take the least used cell (over all queries for this permutation Proof of Lower Bound: contd
30. CPM June 2008 30 Take the least used cell (over all queries for this permutation
And NUKE (eliminate) it Proof of Lower Bound: contd
31. CPM June 2008 31 And continuing removing cells for a while ..
This means some queries may become unanswerable (no matter how many probes made) but other are still OK
e.g. removing a cell for p(6) (=56) & p-1 (56) (=6) makes these unanswerable, versus cell for p(9) (=52) but not p(52)-1 (=9),
We do have to remember what we removed (though not the order) Proof of Lower Bound: contd
32. CPM June 2008 32 So we save the space for the values we no longer need, but we do have to remember which are destroyed
d locations destroyed, order doesnt matter:
d lg(n/d) bits used to say what is gone
But
d lg(n) bits saved Proof of Lower Bound: Saving Space
33. CPM June 2008 33 Now some queries dont work
p(is) s=1,..c & p-1(js) s=1,..c
We know is & js but not their correspondence
encode it
After reduction we still need lg (n!) bits (averaging over all permutations)
So reduce to that point .. Do arithmetic, bound follows
Proof of Lower Bound: Finishing
34. CPM June 2008 34 Key point : reciprocal relation
Text search operations
F = access substring length p starting in ip+1, i=0,n/p
I = search(X,j) jth (aligned) occurrence of X
Theorem(Golynski): rtt = O(np(lg s)2/?2)
r=extra space in words; s=alphabet; ?=word size
For lg n substring linear extra space needed same as Demaine & Lopez-Ortiz, but better model Text Search Lower Bound
35. CPM June 2008 35 Interesting, and useful, combinatorial objects can be:
Stored succinctly lower bound +o()
So that
Natural queries are performed in O(1) time (or at least very close)
Indeed our o() terms are often optimal
But border on operations is subtle Conclusion