230 likes | 355 Views
Applying Syntactic Similarity Algorithms for Enterprise Information Management. Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch Hewlett-Packard Labs. New Applications in the Enterprise.
E N D
Applying Syntactic Similarity Algorithms for Enterprise Information Management Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch Hewlett-Packard Labs
New Applications in the Enterprise • Document deletion and compliance rules • how do you identify all the users who might have a copy of these files? • E-Discovery • identify and retrieve a complete set of related documents (all earlier or later versions of the same document) • Simplify the review process: in the set of semantically similar documents (returned to the expert) identify clusters of syntactically similar documents • Keep document repositories with up-to-date information • to identify and filter out the documents that are largely duplicates of newer versions in order to improve the quality of the collection.
Syntactic Similarity • Syntactic similarity is useful to identify documents with a large textual intersection. • Syntactic similarity algorithms are entirely defined by the syntactic (text) properties of the document • Shingling technique (Broder et al) • Goal: to identify near-duplicates on the web • document A is represented by the set of shingles (sequences of adjacent words)
= 6 w1 w2 w3 w4 wN wj wj Parameter: a shingle size (moving window) Traditionally, a shingle size is defined as a number of words. In our work, we define a shingle size (moving window) via the number ofbytes. Shingling technique A: S(A) = {w1, w2, … , wj, …, wN } the set of all shingles in document A
Basic Metrics • Similarity metric (documents A and B are ~similar) • Containment metric (document A is ~contained in B)
Shingling-Based Approach • Instead of comparing shingles (sequences of words) it is more convenient to deal with fingerprints (hashes) of shingles • 64-bit Rabin fingerprints are used due to fast software implementation • To further simplify the computation of similarity metric one can sample the document shingles to build a more compact document signature • i.e., instead of 1000 shingles take a sample of 100 shingles • Different ways of sampling the shingles lead to different syntactic similarity algorithms
Four Algorithms • We will compare performance and properties of the four syntactic similarity algorithms: • Three shingling-based algorithms (Minn, Modn, Sketchn) • Chunking-based algorithm (BSWn) • Three shingling-based algorithms (Minn, Modn,Sketchn) differ how they sample the set of document shingles and build the document signature.
f(w1) f(w2) Minn Algorithm • Let S(A)={f(w1), f(w2), …., f(wN)} be all fingerprinted shingles for document A. • Minn: it selects the n numerically smallest fingerprinted shingles. • Documents are represented by fixed-size signatures A:
f(w1) f(w2) Modn Algorithm A: • Let S(A)={f(w1), f(w2), …., f(wN)} be all fingerprinted shingles for A. • Modn selects all fingerprints whose value modulo n is zero. • Example: If n=100 and A=1000 bytes then Mod100(A) is represented by approximately 10 fingerprints. • Documents are represented by variable-size signatures (proportional to the document size)
w1 w2 Sketchn Algorithm • Each shingle is fingerprinted with a family of independent hash functions f1,…, fn • For each fi the fingerprint with smallest value is retained in the sketch. • Documents are represented by fixed-size signatures: {min f1(A), min f2(A), …, min fn(A) } • This algorithm has an elegant theoretical justification that the percentage of common entries in sketches of A and B accurately approximates the percentage of common shingles in A and B. f1 f2 … fn min {f1(w1), f1(w2), …., f1(wN) } A: min{ fi(w1), fi(w2), …., fi(wN) }
Chunk is represented by the smallest fingerprint within the chunk min { f(w1), f(w2), …., f(wk) } f(wk) f(w1) A: f(w2) f(wk) mod n = 0 chunk boundary condition BSWn (Basic Sliding Window) Algorithm • Document is represented by the chunks. • Documents are represented by variable-size signatures (the signature is proportional to the document size) • Example: If n=100 and A=1000 bytes then BSW100(A) is represented by approximately 10 fingerprints.
Algorithm’s Properties and Parameters • Algorithm’s properties: • Algorithm’s parameters: • Sliding window size • Sampling frequency • Published papers use very different values • Questions: • Sensitivity of the similarity metric to different values of algorithm’s parameters • Comparison of the four algorithms
Objective and Fair Comparison • How to objectively compare the algorithms? • While one document collection might favor a particular algorithm, the other collection might show better results for a different algorithm • Can we design a framework for fair comparison? • Can the same framework be used for sensitivity analysis of the parameters?
Methodology • Controlled set of modifications over a given document set: • add/remove words in the documents a predefined number of times
Methodology • Research corpus RCorig: 100 different HPLabs TRs from 2007 converted to a text format • Introduce modifications to documents in a controlled way: • Add/remove words to/from the document a predefined number of times • Modifications can be done in a random fashion or uniformly spread through the document • RCia= {RCorig, where word “a “ is inserted into each document i times } • New average similarity metric:
Sensitivity to Sliding Window Size • Window=20 is a good choice (~4words) • Larger size window decreases significantly the similarity metric.
Frequency Sampling RCa50 • A big variance in similarity metric values for different documents under the smaller frequency sampling. • Frequency sampling parameter depends on the document length distribution and should be tuned accordingly. • Trade-off between the accuracy and the storage requirements
Comparison of Similarity Algorithms • Sketchn and BSWn are more sensitive to the number of changes in the documents (especially short ones) than Modnare Minn
Case study using Enterprise Collections • Two enterprise collections: • Collection_1 with 5040 documents; • Collection_2 with 2500 documents.
Results • Algorithms Modnare Minn have identified higher number of similar documents (with Modn being a leader). • However, Modn has a higher number of false positives. • For longer documents the difference between the algorithms is smaller. • Moreover, for long documents (> than100KB) BSWn and related chunking-based algorithms might be a better choice (accuracy and storage wise).
Runtime Comparison • Executing Sketchn is more expensive, especially for larger window size.
Conclusion • Syntactic similarity is useful to identify documents with a large textual intersection. • We designed a useful framework for a fair algorithm comparison: • compared performance of four syntactic similarity algorithms, and • identified a useful range of their parameters • Future work: modify, refine, and optimize the BSW algorithm: • Chunking-based algorithms are actively used for deduplication in backup and storage enterprise solutions.
Sensitivity to Sliding Window Size • Potentially, Modn algorithm might have a higher rate of false positives