Finding Syntactic Similarities Between XML Documents

Finding Syntactic Similarities Between XML Documents Davood Rafiei University of Alberta Joint work with Daniel Moise University of Alberta and Dabo Sun University of Alberta

Motivations • Ranked retrievals e.g. query: book[author=‘Abiteboul’ and year=‘2000’] • DTD extraction • useful for query processing • Clustering • for efficient storage and indexing • for efficient retrievals (similar documents are expected to match the same queries more often)

Problem Statement • How to measure similarity (or distance) between XML documents • Desired properties • The distance must be a metric • Documents generated by the same DTD are expected to have less distance • Documents with more common tags are expected to be more similar • Interested in syntactic similarity only

Examples • Similar documents • Non-similar documents <book><author>Abiteboul</author><year>2000</year></book> <book><author>John</author><year>1994</year></book> <book><author>George</author><title>Animal Farm</title></book> <book><author>Abiteboul</author><year>2000</year></book> <کتاب><نویسنده>Abiteboul</نویسنده><سال>2000</سال></کتاب> <KSIĄŻKA><autor>Abiteboul</autor><rok>2000</rok></KSIĄŻKA> <X><Y>John</Y><Z>20</Z></X>

Related Work • Structural Similarity • Edit distance between ordered trees (Nierman and Jagadish [11], Zhang et al. [21, 23], Chawate et al. [96]) • Edit distance between unordered trees: NP-Complete (Zhang et al. [22]) • Specialized Solutions (Flesca et al. [5], Zaki and Aggrawal [20])

Related Work (Cont.) • More Syntactic Similarity • Based on common parent-child tags (Lian et al. [10]); e.g. of non-similar documents<paper><journal><author>A</author><title>T</title><year>2006</year></journal></paper><paper><conference><author>A</author><title>T</title><year>2006</year></conference></paper> • Use parent-child tags, twigs, content terms, semantic relationships (Theobald et al. [18])

Structural Sketch <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> t d For every path in d, there is a path in t and vice versa and t is minimal.

Sketch Similarity • Problems of matching trees • Sketch tree is not unique <user> <person><name>John</name></person> </user> <user> <person> <name>Mary</name> <id>u200</id> </person> </user> t d

Path Sets user/person/name user/person/id Root paths user/person/name user/person/id user/person person/name user person name person/id id Path set

Similar Path Sets • Standard set comparisons apply • E.g. Cosine, Jaccard, Dice • Path set size nl(l+1)/2 • for n root paths, each of length l • Fast similarity comparison • Cost: linear on the size of the path set

Evaluation • Effectiveness in clustering documents generated by the same DTD • Count the mis-clusterings • For result comparison • Used the same dataset and setting as some earlier work • Also used a larger dataset

Real Data • XML files of ACM Sigmod Record since March 1999 • Four DTDs (total of 989 xml files) • ProceedingsPage 17 xml files • IndexTermsPage 920 xml files • OrdinaryIssuePage 51 xml files • SigmodRecod 1 xml file

Synthetic Data • Generated using IBM xml generator • DTDs • Set A: the set used by Neirman and jagadish • Set B: set A plus 5 more DTDs • Parameters • M: max repeat for + or * • P: probability of an optional attribute

Example Clusters

Mis-Clusterings • Cosine was used for similarity measurements • Also tried Jaccard and Dice coefficients but the results weren’t better.

Comparison Our results Earlier results

Tag Frequency

Conclusions • Presented a method for clustering documents generated by the same DTD • Compared to tree-edit distance-based methods, our method is • more effective (based on our evaluations) • and also much more efficient

Future Work • Detecting documents with similar structures and related tag names, e.g. • Possible solutions: • allow users to specify relabeling rules • Learn relabeling rules from a training data <book><author>Abiteboul</author><year>2000</year></book> <KSIĄŻKA><autor>Abiteboul</autor><rok>2000</rok></KSIĄŻKA>

Finding Syntactic Similarities Between XML Documents