1 / 22

Storing and Querying Multi-version XML Documents using Durable Node Numbers

Storing and Querying Multi-version XML Documents using Durable Node Numbers. Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu. Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu. Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu. Donghui Zhang Dept. of CS&E

Download Presentation

Storing and Querying Multi-version XML Documents using Durable Node Numbers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storing and Querying Multi-version XML Documents using Durable Node Numbers Shu-Yao Chien Dept. of CS UCLA csy@cs.ucla.edu Vassilis J. Tsotras Dept. of CS&E UC Riverside tsotras@cs.ucr.edu Carlo Zaniolo Dept. of CS UCLA zaniolo@cs.ucla.edu Donghui Zhang Dept. of CS&E UC Riverside donghui@cs.ucr.edu

  2. Document Version Management • Traditional applications migrating to the web: • Software configuration management • Cooperative work • CAD • An array of web-based applications: • Web content providers and trackers • Link Permanence • WebDAV

  3. Problem Definition • An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements • Main requirements and research challenges: • Efficient version retrieval • Storage efficiency • Complex query support

  4. Traditional Versioning Schemes • Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage. • RCS (Revision Control System) : • stores the latest version in its entirety, and • old versions represented by deltas ---reverse edit script • minimizes storage cost • version retrieval cost grows linearly with version number • SCCS (Source Code Control System) : • objects timestamped and stored by their document order • version retrieval cost as high as whole change history • These schemes are used by most current systems---but need improvements in storage management, retrieval, query, and support for complex objects.

  5. Databases --- Temporal, OO, Semi-structured, XML DB, … • DBs for CAD and for semi-structured information paid much attention to version support • Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi-Version B+ -Trees, etc. • But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query) • Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions.

  6. Storage Level Enhancement • UBCC [WebDB200] enhances RCS with page management • Flexibility of trading off storage and retrieval costs • Using the concept of Page Usefulness • Captures the information on the order of the object document in the (forward) edit script

  7. Versn Page Usefulness T1 A B C D 100% Useful T2 DEL A B C D 75% T3 DEL DEL Useless A B C D 25% Page Usefulness – by Example • We set a minimum usefulness requirement Umin, e.g. 70% (0 < Umin <= 1). • A page is useful/useless when its usefulness is above/ below Umin .

  8. DEL DEL DEL VERSION 2 INS(Sec J) INS(Fig M) INS(Ch K), INS(Sec L) P3 P4 Usefulness Based Copy Control (UBCC) • STEP 1 : Determine page usefulness for copying. • STEP 2 : Append new/copied objects into new pages by • their logical order. VERSION 1 P2 , U(P2) = 50%< Umin=70% P1 , U(P1)=75% Root Ch A Fig D Sec E Ch B Sec F Fig G Fig H COPY Sec J Ch B Sec F Fig M Ch K Sec L , U(P4)=100% , U(P3)=100%

  9. New Support are Needed … • Complex Query Support: • Temporal Selection • Structural Projection • Content-Based Selection • Regular Path Expression • Query on Diff • UBCC is not efficient in supporting version queries. • A new scheme is needed …

  10. The SPaR Versioning Scheme • SPaR numbering scheme • Version model • Complex query support • Usefulness-based storage strategy

  11. SPaR Numbering Scheme • XML document structure are represented by: • a Durable Node Number(DNN) , and • a Range • DNN is a sparse numbering scheme that preserves element order. • Range preserves parent-child relationships. • Documents can be decomposed and stored as separate elements, then reconstructed (maybe partially) when needed. • Indexes can be built upon DNN and Range for efficient XML query evaluation.

  12. 1 100 5 30 51 80 21 25 55 65 Root dnn=1 range=100 Ch A Ch B dnn=5 dnn=51 range=25 range=30 Fig D Sec E Sec F Fig H dnn=11 dnn=21 dnn=55 dnn=71 range=2 range=5 range=10 range=2 Fig G dnn=61 range=2 SPaR Numbering Scheme --- by Example • DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order). • Range preserves parent-child containment relationship such that: dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P).

  13. Durability upon Updates • Unused ranges are saved between consecutive elements for future insertions. • When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z. • Range overflow is handled by floating point numbers with variable length.

  14. SPaR Version Model • Elements are stored by their DNN order along with: • Lifespan --- (Tstart , Tend) • SPaR range • Adding a new version, VN : • Delete(E) – Set E.Tend to VN-1 and free its SPaR range. • Insert(E) – Set the lifespan of E to (N,now) and assign it an unused SPaR range. • Update(E,new-value) – Delete(E) + Insert(new_value) using the same SPaR range. • New elements of VN are appended into data pages by their DNN order. • However, elements of VN may be scattered among low usefulness data pages …

  15. Version Reconstruction • To reconstruct version VN : • Step 1 --- Locate useful data pages using the Sparse Page Index. • Step 2 --- Ordering elements according to their DNN number. • Step 3 --- Reconstruct the ordered-tree structure of the document.

  16. Step 1 --- Locate Useful Pages • Sparse Page Index Version # 1 2 3 4 5 6 7 8 9 10 P1 P2 P3 P4 P5 P6 P7 P8 List L P1(1,now) P4(3,now) P8(9,now) lpr P2(1,6) P5(4,10) P3(2,5) P6(7,8) P7(8,9)

  17. Step 2 and 3 • Ordering elements by their DNN numbers --- • Valid elements inserted at the same version are already sorted by their DNN number, for instance : • Merge-sort these sorted lists. • Reconstructing ordered-tree structure --- • Parent-child is determined by SPaR ranges. • Sibling order is implied by the DNN order. • Maintain a backward ancestor stack for back-tracking. … … V3 Ch Sec Fig Sec Fig Fig … … V7 Sec Fig Sec Fig Sec Fig … … … V13 Ch Fig Sec Sec Fig Fig

  18. Regular Path Expression • Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.” doc[version=10]/Ch/*/Fig • Basic Ideas: • Traditional algorithms trace tree structure to match path pattern. • SPaR range makes it possible to evaluate path query simply using relational join operator. • We use SPaR range of Ch elements to reduce the search space for Fig elements. • Multi-version B+ Tree is built to help search based upon DNN numbers.

  19. Ch_MVBT Fig_MVBT … … SPaR : (200,300) Life : (1,now) Loc : Page P1 SPaR : (500,700) Life : (3,now) Loc : Page P1 … SPaR : (250,260) Life : (2,10) Loc : Page P3 SPaR : (400,410) Life : (1,now) Loc : Page P5 … SPaR : (480,490) Life : (1,15) Loc : Page P1 SPaR : (550,560) Life : (2,9) Loc : Page P1 Dense Element Index • Multi-version B+ Tree (MVBT)keeps history for B+ Tree. • We use MVBT to build dense element indexes.

  20. Performance • Pages stored : size(RCS)/(1-Umin) • Retrieval of single version : size(Version)/Umin pages • UBCC uses a separate edit script pointing to the data • to retrieve only useful pages • in the right order! • SPaR scheme only needs SPaR ranges to reconstruct versions. • SPaR is slightly better than UBCC in storage cost and version reconstruction.

  21. Performance and Storage Cost(10% inserted, 10% deleted)

  22. Conclusion and Future Work • The web changes everything—XML unifies everything. • It’s time for a new technology that merges and overcomes the limitations of traditional versioning schemes and temporal databases. • Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme. • Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries. • Emerging issues: • Query language support for version queries. • User interface for browsing versions and presenting query results

More Related