190 likes | 212 Views
Explore methods to measure and compare syntactic similarities between XML documents, useful for efficient storage, indexing, and retrieval. Evaluate clustering effectiveness and propose future enhancements.
E N D
Finding Syntactic Similarities Between XML Documents Davood Rafiei University of Alberta Joint work with Daniel Moise University of Alberta and Dabo Sun University of Alberta
Motivations • Ranked retrievals e.g. query: book[author=‘Abiteboul’ and year=‘2000’] • DTD extraction • useful for query processing • Clustering • for efficient storage and indexing • for efficient retrievals (similar documents are expected to match the same queries more often)
Problem Statement • How to measure similarity (or distance) between XML documents • Desired properties • The distance must be a metric • Documents generated by the same DTD are expected to have less distance • Documents with more common tags are expected to be more similar • Interested in syntactic similarity only
Examples • Similar documents • Non-similar documents <book><author>Abiteboul</author><year>2000</year></book> <book><author>John</author><year>1994</year></book> <book><author>George</author><title>Animal Farm</title></book> <book><author>Abiteboul</author><year>2000</year></book> <کتاب><نویسنده>Abiteboul</نویسنده><سال>2000</سال></کتاب> <KSIĄŻKA><autor>Abiteboul</autor><rok>2000</rok></KSIĄŻKA> <X><Y>John</Y><Z>20</Z></X>
Related Work • Structural Similarity • Edit distance between ordered trees (Nierman and Jagadish [11], Zhang et al. [21, 23], Chawate et al. [96]) • Edit distance between unordered trees: NP-Complete (Zhang et al. [22]) • Specialized Solutions (Flesca et al. [5], Zaki and Aggrawal [20])
Related Work (Cont.) • More Syntactic Similarity • Based on common parent-child tags (Lian et al. [10]); e.g. of non-similar documents<paper><journal><author>A</author><title>T</title><year>2006</year></journal></paper><paper><conference><author>A</author><title>T</title><year>2006</year></conference></paper> • Use parent-child tags, twigs, content terms, semantic relationships (Theobald et al. [18])
Structural Sketch <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> t d For every path in d, there is a path in t and vice versa and t is minimal.
Sketch Similarity • Problems of matching trees • Sketch tree is not unique <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> t d
Path Sets user/person/name user/person/id Root paths user/person/name user/person/id user/person person/name user person name person/id id Path set
Similar Path Sets • Standard set comparisons apply • E.g. Cosine, Jaccard, Dice • Path set size nl(l+1)/2 • for n root paths, each of length l • Fast similarity comparison • Cost: linear on the size of the path set
Evaluation • Effectiveness in clustering documents generated by the same DTD • Count the mis-clusterings • For result comparison • Used the same dataset and setting as some earlier work • Also used a larger dataset
Real Data • XML files of ACM Sigmod Record since March 1999 • Four DTDs (total of 989 xml files) • ProceedingsPage 17 xml files • IndexTermsPage 920 xml files • OrdinaryIssuePage 51 xml files • SigmodRecod 1 xml file
Synthetic Data • Generated using IBM xml generator • DTDs • Set A: the set used by Neirman and jagadish • Set B: set A plus 5 more DTDs • Parameters • M: max repeat for + or * • P: probability of an optional attribute
Mis-Clusterings • Cosine was used for similarity measurements • Also tried Jaccard and Dice coefficients but the results weren’t better.
Comparison Our results Earlier results
Conclusions • Presented a method for clustering documents generated by the same DTD • Compared to tree-edit distance-based methods, our method is • more effective (based on our evaluations) • and also much more efficient
Future Work • Detecting documents with similar structures and related tag names, e.g. • Possible solutions: • allow users to specify relabeling rules • Learn relabeling rules from a training data <book><author>Abiteboul</author><year>2000</year></book> <KSIĄŻKA><autor>Abiteboul</autor><rok>2000</rok></KSIĄŻKA>