1 / 19

Finding Syntactic Similarities Between XML Documents

Explore methods to measure and compare syntactic similarities between XML documents, useful for efficient storage, indexing, and retrieval. Evaluate clustering effectiveness and propose future enhancements.

acho
Download Presentation

Finding Syntactic Similarities Between XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Syntactic Similarities Between XML Documents Davood Rafiei University of Alberta Joint work with Daniel Moise University of Alberta and Dabo Sun University of Alberta

  2. Motivations • Ranked retrievals e.g. query: book[author=‘Abiteboul’ and year=‘2000’] • DTD extraction • useful for query processing • Clustering • for efficient storage and indexing • for efficient retrievals (similar documents are expected to match the same queries more often)

  3. Problem Statement • How to measure similarity (or distance) between XML documents • Desired properties • The distance must be a metric • Documents generated by the same DTD are expected to have less distance • Documents with more common tags are expected to be more similar • Interested in syntactic similarity only

  4. Examples • Similar documents • Non-similar documents <book><author>Abiteboul</author><year>2000</year></book> <book><author>John</author><year>1994</year></book> <book><author>George</author><title>Animal Farm</title></book> <book><author>Abiteboul</author><year>2000</year></book> <کتاب><نویسنده>Abiteboul</نویسنده><سال>2000</سال></کتاب> <KSIĄŻKA><autor>Abiteboul</autor><rok>2000</rok></KSIĄŻKA> <X><Y>John</Y><Z>20</Z></X>

  5. Related Work • Structural Similarity • Edit distance between ordered trees (Nierman and Jagadish [11], Zhang et al. [21, 23], Chawate et al. [96]) • Edit distance between unordered trees: NP-Complete (Zhang et al. [22]) • Specialized Solutions (Flesca et al. [5], Zaki and Aggrawal [20])

  6. Related Work (Cont.) • More Syntactic Similarity • Based on common parent-child tags (Lian et al. [10]); e.g. of non-similar documents<paper><journal><author>A</author><title>T</title><year>2006</year></journal></paper><paper><conference><author>A</author><title>T</title><year>2006</year></conference></paper> • Use parent-child tags, twigs, content terms, semantic relationships (Theobald et al. [18])

  7. Structural Sketch <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> t d For every path in d, there is a path in t and vice versa and t is minimal.

  8. Sketch Similarity • Problems of matching trees • Sketch tree is not unique <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> t d

  9. Path Sets user/person/name user/person/id Root paths user/person/name user/person/id user/person person/name user person name person/id id Path set

  10. Similar Path Sets • Standard set comparisons apply • E.g. Cosine, Jaccard, Dice • Path set size nl(l+1)/2 • for n root paths, each of length l • Fast similarity comparison • Cost: linear on the size of the path set

  11. Evaluation • Effectiveness in clustering documents generated by the same DTD • Count the mis-clusterings • For result comparison • Used the same dataset and setting as some earlier work • Also used a larger dataset

  12. Real Data • XML files of ACM Sigmod Record since March 1999 • Four DTDs (total of 989 xml files) • ProceedingsPage 17 xml files • IndexTermsPage 920 xml files • OrdinaryIssuePage 51 xml files • SigmodRecod 1 xml file

  13. Synthetic Data • Generated using IBM xml generator • DTDs • Set A: the set used by Neirman and jagadish • Set B: set A plus 5 more DTDs • Parameters • M: max repeat for + or * • P: probability of an optional attribute

  14. Example Clusters

  15. Mis-Clusterings • Cosine was used for similarity measurements • Also tried Jaccard and Dice coefficients but the results weren’t better.

  16. Comparison Our results Earlier results

  17. Tag Frequency

  18. Conclusions • Presented a method for clustering documents generated by the same DTD • Compared to tree-edit distance-based methods, our method is • more effective (based on our evaluations) • and also much more efficient

  19. Future Work • Detecting documents with similar structures and related tag names, e.g. • Possible solutions: • allow users to specify relabeling rules • Learn relabeling rules from a training data <book><author>Abiteboul</author><year>2000</year></book> <KSIĄŻKA><autor>Abiteboul</autor><rok>2000</rok></KSIĄŻKA>

More Related