1 / 31

(A comparative study for XML change detection)

Etude comparative sur la détection de changements en XML. (A comparative study for XML change detection). Grégory Cobéna (INRIA) , Talel Abdessalem (ENST), Yassine Hinnach (ENST). Context. Consider change-control in XML data warehouses. We want to understand changes

chidi
Download Presentation

(A comparative study for XML change detection)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Etude comparative sur la détection de changements en XML (A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST)

  2. Context • Consider change-control in XML data warehouses. We want to understand changes • We have only the old and new version of documents • A diff need to be computed Grégory Cobéna (INRIA)

  3. Organization • Motivations • Data Model • Representing Changes • Version Management and Querying • Comparison of Change representation models • Experiments • Detecting Changes • State of the art in change detection • Performance analysis and experiments • Quality analysis and experiments • Summary Grégory Cobéna (INRIA)

  4. Motivations

  5. Motivations: Representing Changes • Version management, which means that the representation should allow for effective storage strategies • Temporal Databases, the support for persistent identification of nodes is mandatory • Monitoring: information about changes is used to support triggers or detect events • Note: HTML or XHTML documents may be used Grégory Cobéna (INRIA)

  6. Motivations: Detecting Changes • Correctness: the diff programs miss no changes • Minimality of the result is important to save storage space and network bandwidth • Semantics: some algorithms consider more semantics in XML documents • Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory • ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results. Grégory Cobéna (INRIA)

  7. Data Model

  8. Data Model (quick overview) • Operations are: • (i) insert, delete applied to leaves or subtrees • (ii) update of text nodes • (iii) move applied to a subtree root, moving the entire subtree • An edit cost is assigned to each operation. Usually, the cost is 1 per node touched • The semantic of move is to identify subtrees even when their context has changed. • We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A). Grégory Cobéna (INRIA)

  9. Tai’s model: delete ‘b’ Selkow’s model: delete ‘b’ Data Model: Intuition root root a b c a b c x y x y Grégory Cobéna (INRIA)

  10. Representing Changes

  11. Representing Changes • Version Management • There are several version management strategies. For instance, when only deltas are stored, their size must be reduced • We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. • A simple text-based version management is possible but can not be used for querying. • Querying Changes • Labeling nodes by prefix+postfix identifiers improves querying algorithms • Labeling nodes with persistent identifiers improves temporal databases • There is no short labeling scheme that is good for both Grégory Cobéna (INRIA)

  12. <catalog> <product> <name>Notebook</name> <description> 2200MHz Pentium4 </description> <price>$1999</price> </product> <product> <name>Digital Camera</name> <description> Fuji FinePix 2600Z </description> <status> Not Available </status> </product> </catalog> <catalog> <product> <name>Notebook</name> <description> 2200MHz Pentium4 </description> <price>$1999</price> </product> <product> <name>Digital Camera</name> <description> Fuji FinePix 2600Z </description> <price> $299 </price> </product> </catalog> Our Example Grégory Cobéna (INRIA)

  13. Different reps Grégory Cobéna (INRIA)

  14. Change Models: XUpdate <xupdate:modifications version="1.0" xmlns:xupdate="http://www.xmldb.org/xupdate"> <xupdate:insert-after select="/catalog[1]/product[2]/description[1]" > <xupdate:element name="price"> $299 </xupdate:element> </xupdate:insert-after> <xupdate:remove select="/catalog[1]/product[2]/status[1]" /> </xupdate:modifications> XPath expression Grégory Cobéna (INRIA)

  15. Change Models:DeltaXML (Example) Same look’n’feel as the document mentions some unchanged nodes <catalog delta='modified'> <product deltaxml:delta='unchanged' /> <product deltaxml:delta='modified'> <status deltaxml:delta='deleted'> Not Available </status> <name deltaxml:delta=‘unchanged’/> <description deltaxml:delta=‘unchanged’/> <price deltaxml:delta='inserted'> $399 </price> </product> </catalog> The order is important (no ids, no move) Grégory Cobéna (INRIA)

  16. Change Models:XyDelta (Example) Persistent identifiers <xydelta v1_XidMap="(1-30)" v2_XidMap="(1-14;18-23;31-33;24-30)"> <delete xid=(15-17) parent=6 position=1> <status>Not Available</status> </delete> <insert xid=(31-33) parent=6 position=4> <price>$399</price> </insert> </xydelta> What is the parent node? Grégory Cobéna (INRIA)

  17. Change Models:Microsoft XDL (Example) Verify consistency Updates an element node <xd:xmldiff srcDocHash=“fd452bab54320191“ xmlns:xd="http://schemas.microsoft.com/xmltools/2002/xmldiff"> <xd:node match="1"> <xd:node match="2"> <xd:change match="3" name="price"> <xd:change match="1"> $299 </xd:change> </xd:change> </xd:node> </xd:node> </xd:xmldiff> Identify nodes Grégory Cobéna (INRIA)

  18. Summary • Unique advantages of XyDelta • A formal model and nice mathematical properties • Persistent identification of nodes (at least as an option) • Still missing for all of them • A framework for querying • Nice features that some are missing • Validation by a DTD (may be a problem for DeltaXML, XyDelta) • Verify the source document (only XDL) • Support of ‘move’ operations (only XyDelta and XDL) • Backward deltas (only XyDelta) • Monitoring the delta (only XUpdate and DeltaXML) Grégory Cobéna (INRIA)

  19. Storage ExperimentsIdentifiers save space when few updates Grégory Cobéna (INRIA)

  20. Change Models: Conclusion • Change monitoring is easier with DeltaXML and XUpdate • Temporal queries are easier to evaluate with XyDelta (persistent identifiers) • Future work: • It is not yet clear how to query changes • Define transaction or synchronization protocols Grégory Cobéna (INRIA)

  21. Detecting Changes

  22. State of the art • Based on the String Edit Problem (1966) • Tree-to-tree correction Algorithms: • find the Minimum Edit Script • in O(m*n) time and space, where m and n are the size of the two documents • Other algorithms • Run in linear time or close • Match nodes or subtrees depending on their content Grégory Cobéna (INRIA)

  23. Experiments:Speed of several algorithm Grégory Cobéna (INRIA)

  24. From: <root> <a> <x/><y/><z/> </a> <a> <x/><y/><z/> <u/><v/> </a> </root> To: <root> <a> <u/><v/> <x/><y/><z/> </a> <a> <x/><y/><z/> </a> </root> The cheapest choice would be to move <u> and <v>. (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting <u> and <v> and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5) Algorithms: Overview Grégory Cobéna (INRIA)

  25. Experiments:Quality (measured by the Edit Cost) Grégory Cobéna (INRIA)

  26. Experiments:Speed (focus on DeltaXML) Grégory Cobéna (INRIA)

  27. Comparison summary • Many other algorithms that have no advantages • MMDiff is the reference for quality • DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular • Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML Grégory Cobéna (INRIA)

  28. Other issues • Constrained Diff is often interesting: • Using ‘keys’ to match specific nodes (e.g. DeltaXML) • Using XMLSchema or DTD information • Time-constrained diff (e.g. XyDiff) • Postprocessing of results? Grégory Cobéna (INRIA)

  29. Summary

  30. What’s next? • Representing Changes: • Unify and improve existing features • Support Queries! • Chain versions? • Change Detection: • We are currently working on Microsoft’s XML Diff • Use XMLSchema (or DTD) information • Mining changes? Use learning ? Grégory Cobéna (INRIA)

  31. merci

More Related