Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch

Applying Syntactic Similarity Algorithms for Enterprise Information Management Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch Hewlett-Packard Labs

New Applications in the Enterprise • Document deletion and compliance rules • how do you identify all the users who might have a copy of these files? • E-Discovery • identify and retrieve a complete set of related documents (all earlier or later versions of the same document) • Simplify the review process: in the set of semantically similar documents (returned to the expert) identify clusters of syntactically similar documents • Keep document repositories with up-to-date information • to identify and filter out the documents that are largely duplicates of newer versions in order to improve the quality of the collection.

Syntactic Similarity • Syntactic similarity is useful to identify documents with a large textual intersection. • Syntactic similarity algorithms are entirely defined by the syntactic (text) properties of the document • Shingling technique (Broder et al) • Goal: to identify near-duplicates on the web • document A is represented by the set of shingles (sequences of adjacent words)

= 6 w1 w2 w3 w4 wN wj wj Parameter: a shingle size (moving window) Traditionally, a shingle size is defined as a number of words. In our work, we define a shingle size (moving window) via the number ofbytes. Shingling technique A: S(A) = {w1, w2, … , wj, …, wN } the set of all shingles in document A

Basic Metrics • Similarity metric (documents A and B are ~similar) • Containment metric (document A is ~contained in B)

Shingling-Based Approach • Instead of comparing shingles (sequences of words) it is more convenient to deal with fingerprints (hashes) of shingles • 64-bit Rabin fingerprints are used due to fast software implementation • To further simplify the computation of similarity metric one can sample the document shingles to build a more compact document signature • i.e., instead of 1000 shingles take a sample of 100 shingles • Different ways of sampling the shingles lead to different syntactic similarity algorithms

Four Algorithms • We will compare performance and properties of the four syntactic similarity algorithms: • Three shingling-based algorithms (Minn, Modn, Sketchn) • Chunking-based algorithm (BSWn) • Three shingling-based algorithms (Minn, Modn,Sketchn) differ how they sample the set of document shingles and build the document signature.

f(w1) f(w2) Minn Algorithm • Let S(A)={f(w1), f(w2), …., f(wN)} be all fingerprinted shingles for document A. • Minn: it selects the n numerically smallest fingerprinted shingles. • Documents are represented by fixed-size signatures A:

f(w1) f(w2) Modn Algorithm A: • Let S(A)={f(w1), f(w2), …., f(wN)} be all fingerprinted shingles for A. • Modn selects all fingerprints whose value modulo n is zero. • Example: If n=100 and A=1000 bytes then Mod100(A) is represented by approximately 10 fingerprints. • Documents are represented by variable-size signatures (proportional to the document size)

w1 w2 Sketchn Algorithm • Each shingle is fingerprinted with a family of independent hash functions f1,…, fn • For each fi the fingerprint with smallest value is retained in the sketch. • Documents are represented by fixed-size signatures: {min f1(A), min f2(A), …, min fn(A) } • This algorithm has an elegant theoretical justification that the percentage of common entries in sketches of A and B accurately approximates the percentage of common shingles in A and B. f1 f2 … fn min {f1(w1), f1(w2), …., f1(wN) } A: min{ fi(w1), fi(w2), …., fi(wN) }

Chunk is represented by the smallest fingerprint within the chunk min { f(w1), f(w2), …., f(wk) } f(wk) f(w1) A: f(w2) f(wk) mod n = 0 chunk boundary condition BSWn (Basic Sliding Window) Algorithm • Document is represented by the chunks. • Documents are represented by variable-size signatures (the signature is proportional to the document size) • Example: If n=100 and A=1000 bytes then BSW100(A) is represented by approximately 10 fingerprints.

Algorithm’s Properties and Parameters • Algorithm’s properties: • Algorithm’s parameters: • Sliding window size • Sampling frequency • Published papers use very different values • Questions: • Sensitivity of the similarity metric to different values of algorithm’s parameters • Comparison of the four algorithms

Objective and Fair Comparison • How to objectively compare the algorithms? • While one document collection might favor a particular algorithm, the other collection might show better results for a different algorithm • Can we design a framework for fair comparison? • Can the same framework be used for sensitivity analysis of the parameters?

Methodology • Controlled set of modifications over a given document set: • add/remove words in the documents a predefined number of times

Methodology • Research corpus RCorig: 100 different HPLabs TRs from 2007 converted to a text format • Introduce modifications to documents in a controlled way: • Add/remove words to/from the document a predefined number of times • Modifications can be done in a random fashion or uniformly spread through the document • RCia= {RCorig, where word “a “ is inserted into each document i times } • New average similarity metric:

Sensitivity to Sliding Window Size • Window=20 is a good choice (~4words) • Larger size window decreases significantly the similarity metric.

Frequency Sampling RCa50 • A big variance in similarity metric values for different documents under the smaller frequency sampling. • Frequency sampling parameter depends on the document length distribution and should be tuned accordingly. • Trade-off between the accuracy and the storage requirements

Comparison of Similarity Algorithms • Sketchn and BSWn are more sensitive to the number of changes in the documents (especially short ones) than Modnare Minn

Case study using Enterprise Collections • Two enterprise collections: • Collection_1 with 5040 documents; • Collection_2 with 2500 documents.

Results • Algorithms Modnare Minn have identified higher number of similar documents (with Modn being a leader). • However, Modn has a higher number of false positives. • For longer documents the difference between the algorithms is smaller. • Moreover, for long documents (> than100KB) BSWn and related chunking-based algorithms might be a better choice (accuracy and storage wise).

Runtime Comparison • Executing Sketchn is more expensive, especially for larger window size.

Conclusion • Syntactic similarity is useful to identify documents with a large textual intersection. • We designed a useful framework for a fair algorithm comparison: • compared performance of four syntactic similarity algorithms, and • identified a useful range of their parameters • Future work: modify, refine, and optimize the BSW algorithm: • Chunking-based algorithms are actively used for deduplication in backup and storage enterprise solutions.

Sensitivity to Sliding Window Size • Potentially, Modn algorithm might have a higher rate of false positives

Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch

Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch

Presentation Transcript

Alistair Edwards

dale veitch

Assembly Line YouTube - Model T Ford YouTube - Lucy, Lucy, Lucy

Lucy Stone

Meet Lucy!

Dementia Alistair Burns

Lucy Locket

Dr. Alistair McNair :

Lucy (Australopithecus)

Lucy Calkins

Lucy Locket

Lucy Daniels

!!!!!!!!!!!!!!!!!!!Lucy Stone!!!!!!!!!!!!!

Lucy Bridgman

Coonrad/Morrey Total Elbow Display

Lucy Hale

St. Lucy

Dementia Alistair Burns