220 likes | 340 Views
Storing and Querying Multi-version XML Documents using Durable Node Numbers. Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu. Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu. Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu. Donghui Zhang Dept. of CS&E
E N D
Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu Donghui Zhang Dept. of CS&E UC Riverside donghui@cs.ucr.edu
Document Version Management • Traditional applications migrating to the web: • Software configuration management • Cooperative work • CAD • An array of web-based applications: • Web content providers and trackers • Link Permanence • WebDAV
Problem Definition • An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements • Main requirements and research challenges: • Efficient version retrieval • Storage efficiency • Complex query support
Traditional Versioning Schemes • Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage. • RCS (Revision Control System) : • stores the latest version in its entirety, and • old versions represented by deltas ---reverse edit script • minimizes storage cost • version retrieval cost grows linearly with version number • SCCS (Source Code Control System) : • objects timestamped and stored by their document order • version retrieval cost as high as whole change history • These schemes are used by most current systems---but need improvements in storage management, retrieval, query, and support for complex objects.
Databases --- Temporal, OO, Semi-structured, XML DB, … • DBs for CAD and for semi-structured information paid much attention to version support • Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi-Version B+ -Trees, etc. • But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query) • Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions.
Storage Level Enhancement • UBCC [WebDB200] enhances RCS with page management • Flexibility of trading off storage and retrieval costs • Using the concept of Page Usefulness • Captures the information on the order of the object document in the (forward) edit script
Versn Page Usefulness T1 A B C D 100% Useful T2 DEL A B C D 75% T3 DEL DEL Useless A B C D 25% Page Usefulness – by Example • We set a minimum usefulness requirement Umin, e.g. 70% (0 < Umin <= 1). • A page is useful/useless when its usefulness is above/ below Umin .
DEL DEL DEL VERSION 2 INS(Sec J) INS(Fig M) INS(Ch K), INS(Sec L) P3 P4 Usefulness Based Copy Control (UBCC) • STEP 1 : Determine page usefulness for copying. • STEP 2 : Append new/copied objects into new pages by • their logical order. VERSION 1 P2 , U(P2) = 50%< Umin=70% P1 , U(P1)=75% Root Ch A Fig D Sec E Ch B Sec F Fig G Fig H COPY Sec J Ch B Sec F Fig M Ch K Sec L , U(P4)=100% , U(P3)=100%
New Support are Needed … • Complex Query Support: • Temporal Selection • Structural Projection • Content-Based Selection • Regular Path Expression • Query on Diff • UBCC is not efficient in supporting version queries. • A new scheme is needed …
The SPaR Versioning Scheme • SPaR numbering scheme • Version model • Complex query support • Usefulness-based storage strategy
SPaR Numbering Scheme • XML document structure are represented by: • a Durable Node Number(DNN) , and • a Range • DNN is a sparse numbering scheme that preserves element order. • Range preserves parent-child relationships. • Documents can be decomposed and stored as separate elements, then reconstructed (maybe partially) when needed. • Indexes can be built upon DNN and Range for efficient XML query evaluation.
1 100 5 30 51 80 21 25 55 65 Root dnn=1 range=100 Ch A Ch B dnn=5 dnn=51 range=25 range=30 Fig D Sec E Sec F Fig H dnn=11 dnn=21 dnn=55 dnn=71 range=2 range=5 range=10 range=2 Fig G dnn=61 range=2 SPaR Numbering Scheme --- by Example • DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order). • Range preserves parent-child containment relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P).
Durability upon Updates • Unused ranges are saved between consecutive elements for future insertions. • When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z. • Range overflow is handled by floating point numbers with variable length.
SPaR Version Model • Elements are stored by their DNN order along with: • Lifespan --- (Tstart , Tend) • SPaR range • Adding a new version, VN : • Delete(E) – Set E.Tend to VN-1 and free its SPaR range. • Insert(E) – Set the lifespan of E to (N,now) and assign it an unused SPaR range. • Update(E,new-value) – Delete(E) + Insert(new_value) using the same SPaR range. • New elements of VN are appended into data pages by their DNN order. • However, elements of VN may be scattered among low usefulness data pages …
Version Reconstruction • To reconstruct version VN : • Step 1 --- Locate useful data pages using the Sparse Page Index. • Step 2 --- Ordering elements according to their DNN number. • Step 3 --- Reconstruct the ordered-tree structure of the document.
Step 1 --- Locate Useful Pages • Sparse Page Index Version # 1 2 3 4 5 6 7 8 9 10 P1 P2 P3 P4 P5 P6 P7 P8 List L P1(1,now) P4(3,now) P8(9,now) lpr P2(1,6) P5(4,10) P3(2,5) P6(7,8) P7(8,9)
Step 2 and 3 • Ordering elements by their DNN numbers --- • Valid elements inserted at the same version are already sorted by their DNN number, for instance : • Merge-sort these sorted lists. • Reconstructing ordered-tree structure --- • Parent-child is determined by SPaR ranges. • Sibling order is implied by the DNN order. • Maintain a backward ancestor stack for back-tracking. … … V3 Ch Sec Fig Sec Fig Fig … … V7 Sec Fig Sec Fig Sec Fig … … … V13 Ch Fig Sec Sec Fig Fig
Regular Path Expression • Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.” doc[version=10]/Ch/*/Fig • Basic Ideas: • Traditional algorithms trace tree structure to match path pattern. • SPaR range makes it possible to evaluate path query simply using relational join operator. • We use SPaR range of Ch elements to reduce the search space for Fig elements. • Multi-version B+ Tree is built to help search based upon DNN numbers.
Ch_MVBT Fig_MVBT … … SPaR : (200,300) Life : (1,now) Loc : Page P1 SPaR : (500,700) Life : (3,now) Loc : Page P1 … SPaR : (250,260) Life : (2,10) Loc : Page P3 SPaR : (400,410) Life : (1,now) Loc : Page P5 … SPaR : (480,490) Life : (1,15) Loc : Page P1 SPaR : (550,560) Life : (2,9) Loc : Page P1 Dense Element Index • Multi-version B+ Tree (MVBT)keeps history for B+ Tree. • We use MVBT to build dense element indexes.
Performance • Pages stored : size(RCS)/(1-Umin) • Retrieval of single version : size(Version)/Umin pages • UBCC uses a separate edit script pointing to the data • to retrieve only useful pages • in the right order! • SPaR scheme only needs SPaR ranges to reconstruct versions. • SPaR is slightly better than UBCC in storage cost and version reconstruction.
Conclusion and Future Work • The web changes everything—XML unifies everything. • It’s time for a new technology that merges and overcomes the limitations of traditional versioning schemes and temporal databases. • Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme. • Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries. • Emerging issues: • Query language support for version queries. • User interface for browsing versions and presenting query results