310 likes | 495 Views
Etude comparative sur la détection de changements en XML. (A comparative study for XML change detection). Grégory Cobéna (INRIA) , Talel Abdessalem (ENST), Yassine Hinnach (ENST). Context. Consider change-control in XML data warehouses. We want to understand changes
E N D
Etude comparative sur la détection de changements en XML (A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST)
Context • Consider change-control in XML data warehouses. We want to understand changes • We have only the old and new version of documents • A diff need to be computed Grégory Cobéna (INRIA)
Organization • Motivations • Data Model • Representing Changes • Version Management and Querying • Comparison of Change representation models • Experiments • Detecting Changes • State of the art in change detection • Performance analysis and experiments • Quality analysis and experiments • Summary Grégory Cobéna (INRIA)
Motivations: Representing Changes • Version management, which means that the representation should allow for effective storage strategies • Temporal Databases, the support for persistent identification of nodes is mandatory • Monitoring: information about changes is used to support triggers or detect events • Note: HTML or XHTML documents may be used Grégory Cobéna (INRIA)
Motivations: Detecting Changes • Correctness: the diff programs miss no changes • Minimality of the result is important to save storage space and network bandwidth • Semantics: some algorithms consider more semantics in XML documents • Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory • ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results. Grégory Cobéna (INRIA)
Data Model (quick overview) • Operations are: • (i) insert, delete applied to leaves or subtrees • (ii) update of text nodes • (iii) move applied to a subtree root, moving the entire subtree • An edit cost is assigned to each operation. Usually, the cost is 1 per node touched • The semantic of move is to identify subtrees even when their context has changed. • We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A). Grégory Cobéna (INRIA)
Tai’s model: delete ‘b’ Selkow’s model: delete ‘b’ Data Model: Intuition root root a b c a b c x y x y Grégory Cobéna (INRIA)
Representing Changes • Version Management • There are several version management strategies. For instance, when only deltas are stored, their size must be reduced • We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. • A simple text-based version management is possible but can not be used for querying. • Querying Changes • Labeling nodes by prefix+postfix identifiers improves querying algorithms • Labeling nodes with persistent identifiers improves temporal databases • There is no short labeling scheme that is good for both Grégory Cobéna (INRIA)
<catalog> <product> <name>Notebook</name> <description> 2200MHz Pentium4 </description> <price>$1999</price> </product> <product> <name>Digital Camera</name> <description> Fuji FinePix 2600Z </description> <status> Not Available </status> </product> </catalog> <catalog> <product> <name>Notebook</name> <description> 2200MHz Pentium4 </description> <price>$1999</price> </product> <product> <name>Digital Camera</name> <description> Fuji FinePix 2600Z </description> <price> $299 </price> </product> </catalog> Our Example Grégory Cobéna (INRIA)
Different reps Grégory Cobéna (INRIA)
Change Models: XUpdate <xupdate:modifications version="1.0" xmlns:xupdate="http://www.xmldb.org/xupdate"> <xupdate:insert-after select="/catalog[1]/product[2]/description[1]" > <xupdate:element name="price"> $299 </xupdate:element> </xupdate:insert-after> <xupdate:remove select="/catalog[1]/product[2]/status[1]" /> </xupdate:modifications> XPath expression Grégory Cobéna (INRIA)
Change Models:DeltaXML (Example) Same look’n’feel as the document mentions some unchanged nodes <catalog delta='modified'> <product deltaxml:delta='unchanged' /> <product deltaxml:delta='modified'> <status deltaxml:delta='deleted'> Not Available </status> <name deltaxml:delta=‘unchanged’/> <description deltaxml:delta=‘unchanged’/> <price deltaxml:delta='inserted'> $399 </price> </product> </catalog> The order is important (no ids, no move) Grégory Cobéna (INRIA)
Change Models:XyDelta (Example) Persistent identifiers <xydelta v1_XidMap="(1-30)" v2_XidMap="(1-14;18-23;31-33;24-30)"> <delete xid=(15-17) parent=6 position=1> <status>Not Available</status> </delete> <insert xid=(31-33) parent=6 position=4> <price>$399</price> </insert> </xydelta> What is the parent node? Grégory Cobéna (INRIA)
Change Models:Microsoft XDL (Example) Verify consistency Updates an element node <xd:xmldiff srcDocHash=“fd452bab54320191“ xmlns:xd="http://schemas.microsoft.com/xmltools/2002/xmldiff"> <xd:node match="1"> <xd:node match="2"> <xd:change match="3" name="price"> <xd:change match="1"> $299 </xd:change> </xd:change> </xd:node> </xd:node> </xd:xmldiff> Identify nodes Grégory Cobéna (INRIA)
Summary • Unique advantages of XyDelta • A formal model and nice mathematical properties • Persistent identification of nodes (at least as an option) • Still missing for all of them • A framework for querying • Nice features that some are missing • Validation by a DTD (may be a problem for DeltaXML, XyDelta) • Verify the source document (only XDL) • Support of ‘move’ operations (only XyDelta and XDL) • Backward deltas (only XyDelta) • Monitoring the delta (only XUpdate and DeltaXML) Grégory Cobéna (INRIA)
Storage ExperimentsIdentifiers save space when few updates Grégory Cobéna (INRIA)
Change Models: Conclusion • Change monitoring is easier with DeltaXML and XUpdate • Temporal queries are easier to evaluate with XyDelta (persistent identifiers) • Future work: • It is not yet clear how to query changes • Define transaction or synchronization protocols Grégory Cobéna (INRIA)
State of the art • Based on the String Edit Problem (1966) • Tree-to-tree correction Algorithms: • find the Minimum Edit Script • in O(m*n) time and space, where m and n are the size of the two documents • Other algorithms • Run in linear time or close • Match nodes or subtrees depending on their content Grégory Cobéna (INRIA)
Experiments:Speed of several algorithm Grégory Cobéna (INRIA)
From: <root> <a> <x/><y/><z/> </a> <a> <x/><y/><z/> <u/><v/> </a> </root> To: <root> <a> <u/><v/> <x/><y/><z/> </a> <a> <x/><y/><z/> </a> </root> The cheapest choice would be to move <u> and <v>. (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting <u> and <v> and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5) Algorithms: Overview Grégory Cobéna (INRIA)
Experiments:Quality (measured by the Edit Cost) Grégory Cobéna (INRIA)
Experiments:Speed (focus on DeltaXML) Grégory Cobéna (INRIA)
Comparison summary • Many other algorithms that have no advantages • MMDiff is the reference for quality • DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular • Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML Grégory Cobéna (INRIA)
Other issues • Constrained Diff is often interesting: • Using ‘keys’ to match specific nodes (e.g. DeltaXML) • Using XMLSchema or DTD information • Time-constrained diff (e.g. XyDiff) • Postprocessing of results? Grégory Cobéna (INRIA)
What’s next? • Representing Changes: • Unify and improve existing features • Support Queries! • Chain versions? • Change Detection: • We are currently working on Microsoft’s XML Diff • Use XMLSchema (or DTD) information • Mining changes? Use learning ? Grégory Cobéna (INRIA)