180 likes | 285 Views
Measuring the Structural Similarity of Semistructured Documents Using Entropy. Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria 2008. 2. 5 Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University. Outline. Introduction
E N D
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007, Vienna, Austria 2008. 2. 5 Summarized and Presented by Seungseok Kang IDS Lab., Seoul National University
Outline • Introduction • Measuring Entropy • Information Distance • Measuring Structural Similarity • Extract structural information • Ziv-Lempel Encoding • Ziv-MerhavCrossparsing • Competitors • Tree-editing distance • Discrete Fourier Transformation • Path shingles • Experiments • Clustering Quality • Document Collections • Results • Conclusion Center for E-Business Technology
Introduction • Semistructured document • HTML, XHTML, XML, … • In traditional IR detecting similarities used widely • For querying • For clustering • Lots of similarity measures for text documents • New challenges with semistructured documents • Measuring structural similarity • Semistructured documents show great structural diversity • Measuring structural similarity can integrate heterogeneous data sources Center for E-Business Technology
Measuring Entropy • Bennet et al. • Introduced concept of universal information metric • Based on Kolmogorov complexity • Given data object x, Kolmogorov complexity K(x) means the length of shortest program that outputs x • Conditional Kolmogorov complexity K(x|y) • Generalized form of Kolmogorov complexity • Length of the shortest program with input y that outputs x Center for E-Business Technology
Information Distance • Similarity of two data objects can be measured by normalized information distance • Based on Conditional Kolmogorov complexity • Has some nice property • Almost a metric property • Lower bound for admissible distance • Unfortunately, Kolmogorov complexity is not computable in general • Can be approximated by compression • C(x) : length of compressed x Center for E-Business Technology
Measuring Structural Similarity • Just Compressing XML files does not get the job done • Extract structural information • Tags • List element/attribute names in document order • Pairwise • Like tags, but with names of parents • Path • Like tags, but with full path to root • Family order • Family-order traversal of document • Except path, all extractions can be done in linear time Center for E-Business Technology
Measuring Structural Similarity (cont.) • After extracting structural information • Come up with similarity measure • NCD with GZIP (Zip-Lempel Encoding) • Ziv-Merhavcrossparsing • Can be done in linear time • We can calcurate the actual value for the NCD by encoding or crossparsing • Crossparsing is alternative way to measure relative entropy between sequences in files • Instead of compressing Center for E-Business Technology
Measuring Structural Similarity (cont.) • Ziv-Lempel Encoding • Build a directory for encoding by utilizing substring that have already appeared in the previous encoded part of the document • <distance, length, symbol> • ex) String to encode “aababbccccd”<0,0,a>,<1,1,b>,<2,2,b>,<0,0,c>,<1,3,d> • Ziv-merhavCrossparsing • Find the longest prefix of string x that appears as a string in y until we have parsed the whole document x • ex) x=“abbbbaaabba”, y=“baababaabba”c(x|y) = 3, the three phrases being “abb”,”bba”,”aabba” Center for E-Business Technology
Competitors • Tree-editing distance • Nierman and Jagadish • Discrete Fourier Transformation • Flesca et al. • Path shingles • Buttler Center for E-Business Technology
Tree-editing distance • Measuring the minimum editing distance between string • Levenshtein distance • Five different edit operations • Relabel • Insert • Delete • Insert Tree • Delete Tree • Quadratic runtime • O(|d1||d2|) • |dn| : size of document d Center for E-Business Technology
Discrete Fourier Transformation • Called DFT technique • The feature of an XML document relevant to its structure are described via a numerical encoding • Numerical values are interpreted as a time series • Rotate document by 90◦, interpret indentations as time series • Use DFT transform to compute similarity • Tag encoding • Document encoding • Document similarity • Runtime • O(NlogN) • where N = max((|d1||d2|) Center for E-Business Technology
Path Shingles • Extract structural information using Full Path variant • Compute a hash value hj for each path • A shingle of width w is the combination of w consecutive hash values • Compute similarity between two documents using Dice coefficient on the two sets of shingles • Not linear run time Center for E-Business Technology
Clustering quality • Measure quality ofsimilarity measure byclustering • Hierarchicalagglomerativeclustering • Quality expressed innumber of misclusteringin dendrogram Center for E-Business Technology
Document Collections • Use three different document collections for experiment • Real data set • SIGMOD record, INEX 2005, music sheets encoded in XML • Synthetically generated data sets from DFT paper • Own synthetically generated various data sets • Element names • Element frequencies • Element positions • Element depths Center for E-Business Technology
Overall Results • Number of misclustering for different methods Center for E-Business Technology
Detailed results • Different method have different strengths and weaknesses • Tree-edit • Generally good, has problems with largely varying document sizes • DFT • Good at frequencies, bad at element names, position, and depth • GZIP/Ziv-Merhav • Bad at frequencies, good at element names, position, and depth • DFT and GZIP/Ziv-Merhav are complementary to each other • Performance of hybrid becomes even better • Hybrid approach does not have linear run time Center for E-Business Technology
Runtime • Shallow documents • Crossparsing/pairwise, shingle path algorithm • Deep documents • Crossparsing/pairwise Center for E-Business Technology
Conclusion • Determining the similarity of document • Based on measuring entropy • Encoding/Crossparsing approach totally different to previous approaches • Can be done in linear time • Important for large document collection • Future work • More sophisticated ways of encoding document structure • Different entropy measure better suited for structural information Center for E-Business Technology