230 likes | 395 Views
A Simple Optimal Representation for Balanced Parentheses. Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India). A Parentheses Data Structure. Given: Balanced string of 2 n parentheses.
E N D
A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India) CPM 2004
A Parentheses Data Structure • Given: Balanced string of 2n parentheses. ( ( ( ( ) ))( )( ) ) • Support operations: • ENCLOSE ( i ) • FINDCLOSE ( i ), FINDOPEN( i ) • EXCESS ( i ) • Applications to suffix tree, ordinal trees and stack-sortable permutations. CPM 2004
Parentheses Representation • 2n bits, O(n) time. • Θ(n lg n ) bits, O(1) time. • O(n) bits, O(1) time. [Jacobson, `89] • 2n+o(n) bits, O(1) time. [Munro, Raman, `01] • 2n+o(n) bits, O(1) time. New data structure. • Our new DS • is simpler (no perfect hash tables), • smaller o(n) term, • uniform o(n) time and space construction algorithm. • Implemented and shown to be quite practical • far more compact than D/S using naïve representation, • speed comparable to D/S using naïve representation. CPM 2004
XML • XML: eXtensible Markup Language • de facto standard for electronic data interchange. • Document Object Model (DOM) standard API for manipulating XML documents • holds all data in memory, • large memory usage. CPM 2004
Example XML document person <person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob> </person> name dob firstname day month year surname • DOM NODE interface has methods PARENT(x), NEXTSIB(x), PREVSIB(x), LASTCHILD(x),FIRSTCHILD(x) CPM 2004
Obvious representation • 2n pointers • DOM: 3n. • Ω(n log n) bits. CPM 2004
Using parentheses 1 <person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob> </person> 2 5 3 4 6 7 8 parentheses representation: ( ( ( ) ( ) ) ( ( ) ( ) ( ) ) ) 1 2 3 4 5 6 7 8 2n + o(n) bits for tree structure. CPM 2004
Node interface ops using Parentheses DS Node interface Parentheses DS PARENT ENCLOSE NEXTSIB FINDCLOSE PREVSIB FINDOPEN LASTCHILD FINDCLOSE, FINDOPEN CPM 2004
Succinct DOM • Succinct DOM: • uses far less space than standard DOM, • performance competitive with DOM. • Node interface implemented by natural parentheses ops. • Operations supported by parentheses data structures • Jacobson `89, • Munro and Raman `01, • Our new data structure. CPM 2004
Our new D/S Input: balanced string of 2n parentheses. Assume recursive data structure to store balanced string of 2N 2n parentheses. If N is O(n / lg2n) store answers explicitly for every pair of parentheses. Otherwise Divide into blocks of size Number of blocks CPM 2004
FINDCLOSE(x) ( (( )((( ))) ( )(( ))( )) ) 1 2 3 4 5 6 7 8 9 10 • FINDCLOSE(3)? • Matching parenthesis inside block – near parenthesis. • Pre-computed table stores position of matching parentheses for all near parentheses. • O(1) time if near parenthesis. • Table size is CPM 2004
Pioneer Parentheses ( (( )((( ))) ( )(( ))( )) ) FINDCLOSE(5)? Matching parenthesis outside block – far parenthesis. b(p) = block# of parenthesis at position p = position of match of p q is 1st far parenthesis before p p is pioneer if At most 2β-3 open pioneers. Similarly at most 2β-3 close pioneers. 1 2 3 4 5 6 7 8 9 10 CPM 2004
Pioneer Family ( (( )((( ))) ( )(( ))( )) ) • Pioneer family: set of all opening and closing pioneers along with their matching parentheses. • Balanced string of size at most 4β-6. 1 2 3 4 5 6 7 8 9 10 ( ( ) ) CPM 2004
Our D/S 2N ( (( )((( ))) ( )(( ))( )) ) NND O(N / lg N) ( ( ) ) Two levels of recursion. When pioneer family is O(N/lg2N) we store explicit answers. CPM 2004
Space usage NND uses O(N lg lg N / lg N) bits. Tables use O( N lglg N / lg N) bits. S(n) = 2n+ O(n lglg n / lg n) = 2n +o(n) bits. CPM 2004
Pseudo-pioneers • Near blocks: blocks which have no pioneers. • Insert pseudo-pioneers at start and end of every near block. • Pseudo-pioneers do not effect FINDOPEN(x), FINDCLOSE(x), ENCLOSE(x) • Gap between pioneers now at most 2B = O(lg N). CPM 2004
NND ( (( )((( ))) ( )(( ))( )) ) • 2n-bit vector used to find the pioneer for a far parenthesis. • If pioneer at pos i in parentheses string then 1 at i in NND. • Operations we need: • Find address of most recent 1 at position i r = Rank(i) p = Select(r) • Find ith 1in bit vector p = Select(i) • We want succinct representation. • D/S should be simple and fast. 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 CPM 2004
NND Bit vector of length M with N 1s. Gap between 1s at most (lg M)c. t = lg M / 2 c lg lg M. CPM 2004
Select(i) • Find ith 1 in bit vector. • Array A1 stores position of every tth 1 • Space is • Array A2 stores gaps between consecutive 1s • Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits. • Table T1 allows us to lookup sum of upto t gap. • Space is SELECT(i) i’ = i’’ = (i+1)/mod t y = concat of A2[i’+1],..,A2[i’+i’’] return A1[x] + T1[y] CPM 2004
Rank(i) • Prefix sum at position i. • Need two more arrays and tables of size at most O(M lg lg M / lg M) bits. CPM 2004
Implementation Details • C++ on Sun UltraSparc-III and Pentium 4. • Implemented new and optimised Jacobson D/S. • CenterPoint XML for DOM. • Sample of 12 XML documents of varying sizes and node counts. • Blocksizes 32, 64, 128 and 256. • Test was depth first tree walk, counting nodes of a given XML type. CPM 2004
Space usage and performance • Space usage for tree structure • Std DOM: 96 bits per node. • Jacobson: 3.3 – 16 bits per node. • New D/S: 2.9 – 12.8 bits per node. • Avg performance for succinct D/S relative to std DOM • UltraSparc: 1 to 2.5 times slower. • Pentium 4: 1.7 to 4 times slower. CPM 2004
Conclusions and Future work • Conceptually simple succinct representation for balanced parentheses with O(1) time ops. • o(n) time and space construction algorithm. • Improved lower bound term for space bound. • Relative performance very good on UltraSparc but poorer on Pentium 4, which has small cache • Cache optimisation is an interesting problem. • Complete set of D/S for succinct DOM. CPM 2004