170 likes | 300 Views
Version Management for XML Documents Copy-Based vs Edit-Based Schemes. Vassilis J. Tsotras Department of Computer Science and Engineering University of California, Riverside tsotras@cs.ucr.edu. Carlo Zaniolo Computer Science Department University of California, Los Angeles
E N D
Version Management for XML Documents Copy-Based vs Edit-BasedSchemes Vassilis J. Tsotras Department of Computer Science and Engineering University of California, Riverside tsotras@cs.ucr.edu Carlo Zaniolo Computer Science Department University of California, Los Angeles zaniolo@cs.ucla.edu Shu-Yao Chien Computer Science Department University of California, Los Angeles csy@cs.ucla.edu
The Problem • Managing (storing, querying) multiple versions documents is important for content providers and cooperative work • Temporal DBs: transaction time, CAD/OO applications • Web/XML changes/unifies everything • Traditional schemes (RCS, SCCS): not optimized for secondary store---no temporal clustering • DB-oriented approaches: not optimized for retrieval of complete documents • Transport level: exchange and processing (browser side) of multiversion documents also critical—need to reconcile storage and exchange representations.
Version Management: Approaches • Time stamping of objects • Store all Snapshots: fast retrieval, excessive storage • Edit-Based Schemes store the Deltas. Minimal storage but slow retrieval. • Traditionally line-oriented DIFF, but semistructured objects in Lorel • Our Scheme: Usefulness Based Copy Control (UBCC) - Separate edit scripts from the objects. - Temporal Clustering of objects using page usefulness.
VERSION 1 <root> <ch A> <sec D> ... </sec> <sec E> … </sec> </ch> <ch B> <sec F> … </sec> <sec G> … </sec> <sec H> … </sec> </ch> </root> Example: an Evolving XML Document Order 1 2 3 4 5 6 7 8 VERSION 2 <root> <ch A> <sec J> … </sec> <sec E> … </sec> </ch> <ch B> <sec F> … </sec> <sec G’> … </sec> </ch> <ch K> <sec L> … </sec> </ch> </root> Order 1 2 3 4 5 6 7 8 9
Temporal Clustering by Page Usefulness • Usefulness: percentage of page occupied by objects from the current version—the rest is occupied by ‘dead’ objects from previous versions • We set a minimum usefulness requirement e.g. 50% • When the usefulness of a page fall below this minimum we copy its live objects to a new page
Maintaining Page Usefulness above 70% by Copying Alive Objects VERSION 1 P1 ,U(P1) =75% P2 ,U(P2) = 50%< Umin=70% O1 O2 O3 O4 O5 O6 O7 O8 VERSION 2 DEL DEL DEL Copied O5 O6 O9 O10 P3 ,U(P3) = 100%
Usefulness Based Copy Control (UBCC) • STEP 1 : Determine page usefulness for copying. • STEP 2 : Append new/copied objects into new pages by • their logical order. VERSION 1 P1 , U(P1) = 75% P2 , U(P2) = 50%< Umin=70% root ch A sec D sec E ch B sec F sec G sec H DEL DEL DEL VERSION 2 INS(sec J) INS(sec G’) INS(ch K), INS(sec L) COPY sec J ch B sec F sec G’ ch K sec L P3 , U(P3)=100% P4 , U(P4)=100%
Document Object Order • Version 2 objects are not stored in sequence : VERSION 2 = ( root1 , sec A2 , sec J3 , sec E4 , ch B5 , sec F6 , sec G’7 , ch K8 , sec L9) P2 P1 root1 sec A2 sec D sec E4 ch B sec F sec G sec H P3 P4 sec J3 ch B5 sec F6 sec G’7 ch K8 sec L9 • Hence, we use the edit script.
Beyond Edit-Based Versioning • The UBCC schemes achieves good storage and retrieval efficiency. • But it is not suitable at the transport level and for query on content • Thus, we propose a copy-based model which : • explores shared elements • needs no edit script • Yields a simple XML representation for the document history
The XML Version Model (XVM) • XVM is a list of version nodes • Each version node is an ordered tree consisting of four types of nodes : • element node • attribute node • text node • copy record node • Minimal extensions to the Xpath data model—the copy record node is actually a link.
V V E E E C E T T E T A A C A Tree Addr Ref : V1.2.1 T T A A Copy-Based XML Version Model (XVM) V Version node Element node E T Text node Attribute node C copy record node A
V V V V2 V3 V1 E E chapter “Intro” chapter “Tutorial” E C E C V1.1 chapter “Second Ex” chapter “Second Ex” V2.1 E chapter “Intro” E E E E C C V2.2.1 section “Concepts” section “Scope” section “Test Data” section “Context” E E V2.1.2 section “Concepts” section “Scope” XVM --- Example Changes : 1. UPDATE the textual content of chapter “Second Ex” 2. COPY the “Concepts” section and insert after section “Test data”. Changes : 1. DELETE chapter “Tutorial” 2. INSERT chapter “Second Ex”
XVM Version Retrieval --- Example V V V V2 V3 V1 E E chapter “Intro” chapter “Tutorial” E C E C V1.1 chapter “Second Ex” chapter “Second Ex” V2.1 E chapter “Intro” E E E E C C V2.2.1 section “Concepts” section “Scope” section “Test Data” section “Context” E E V2.1.2 section “Concepts” section “Scope”
XVM Benefits • Transport Level:Represent XVM as an XML document—its DTD automatically generated from the document DTD • Storage Level: we extended the usefulness-based temporal clustering scheme to XVM
XVM Implementation --- Use XML to Represent XVM • DTD Transformation : • Define three new elements : <Repository>, <Version> and <CopyRecord>. • For each element in the original DTD add to its content model a CopyRecord as an alternate. • Example : Version DTD <!ELEMENT Repository (Version)+> <!ELEMENT Version (volumn)> <!ELEMENT CopyRecord> <!ATTLIST CopyRecord Ref IDREF> <!ELEMENT volumn(chapter)*> <!ELEMENT chapter ((title,(sec)*)|CopyRecord)> <!ELEMENT title ((#PCDATA)|CopyRec)> <!ELEMENT sec ((#PCDATA)|CopyRec)> . . . Original DTD <!ELEMENT volumn (chapter)*> <!ELEMENT chapter (title,(sec)*)> <!ELEMENT title (#PCDATA)> <!ELEMENT sec (#PCDATA)> . . .
Conclusion • UBCC is efficient at the storage level. • The copy-based scheme is effective as a storage representation and a transport representation • Our current research focuses on efficient evaluation of queries on versions: • content queries, • snapshot queries, • history queries.