Document sketching

Document sketching. Problem: duplicate or near-duplicate identification in a collection of documents How to measure the similarity between documents? A reasonable (?) candidate: edit distance Computationally expensive Another measure: resemblance due to [ Broder '97].

  1. Document sketching • Problem: duplicate or near-duplicate identification in a collection of documents • How to measure the similarity between documents? • A reasonable (?) candidate: edit distance • Computationally expensive • Another measure: resemblance due to [Broder ‘97]

  2. Resemblance of documents [Broder‘97] • : resemblance between documents and • . Similar means close to • Convert documents to a set of integers • A contiguous sequence of length contained in document is called a -shingle • Example: (a rose is a rose is a rose) • -shingles of are: (a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose) • The set of -shingles of : {(a rose is a), (rose is a rose), (is a rose is)} • Map shingles to integers (for some fixed ) • From now on, identify the documents with sets of integers in • Thus a document is represented as a set of integers • (also known as Jaccard similarity between sets and ) • Thus, , but does not mean • In practice, is a reasonable approximation of the informal notion of similarity of

  3. Estimating resemblance • Given : • Estimate: • Exact computation of requires time • A basic estimator for • : set of permutations • Choose a random • Variance too high

  4. Reducing variance First method • Sample random permutations • Sketch of document is • Resemblance can be estimated as (this is an unbiased estimator; proof follows from the previous slide) Second method • Let denote the set of smallest elements of , and if , then • For a constant , and uniformly random is an unbiased estimator of (details on the board)

  5. We can estimate within multiplicative error with (for both methods above) • The second method above gives us a way of sketching the documents: Fix a permutation , and a constant For document , its sketch is • Now given the sketches of documents , using the same permutation , we can estimate the resemblance of pairs • Sketch of a document takes space and estimating resemblance takes time • (We can also do it with the second method but we will need to store permutations)

  6. Document sketching in small space • One problem with this: storing permutations is expensive • Question: Can we work with a small set of permutations instead of ? • Yes: Min-wise independent permutations [Broder et al. ‘98] • Can also use 2-wise independent hash functions [Thorup 2013]

  7. Sampling from data streams

  8. Sampling from a data stream • How to select a uniformly random size subset of ? • Choose the th element with probability if elements have already been selected • What if the set is given via a stream and we don’t know its length in advance? • There is a solution similar to the previous one, but the following is easier: • For the th item, sample uniformly at random, keep the items with highest value of the

  9. Sampling for subset sum estimation • Given a stream of positive weights, we want to keep a small amount of information so that later we can estimate the weight of any given subset (the weight of a subset is the sum of the weights in it) First Solution (Poisson sampling) • Choose any probabilities for each weight • On encountering , include it in set with probability (independent of previous decisions) • Given any set (chosen in advance before the selection of ), estimator for • This is an unbiased estimator for • The expected number of samples

  10. Poisson sampling • Smaller sample set does not come for free: Variance in the estimate of the weight of the th item • One issue with this solution: The sample size is not fixed (although it can be concentrated around the mean) • Another issue: What should be the values of the ? If we want the sample to be size in expectation, then a possible choice is • But • may not be known • this sampling is not weight-sensitive: we may want to choose to be larger for larger to reduce the variance

  11. Priority sampling [Duffield et al. 2007] Second solution (priority sampling): • For each item, generate an independent uniform • Priority of item is given by • We assume all priorities are distinct (true with probability ) • For a given the priority sample of size is given by the items of highest priority • th priority, thus iff • For let if and otherwise Properties of priority sampling: • Maintains sample of fixed size • For • And so, for T (proof on the board; also in the Duffield et al. paper)

  12. Priority sampling properties We won’t prove the following: • For distinct , and have covariance • So the variance of the estimate of the weight of a set is the sum of the variances of the estimators for the items in the set • The total variance (sum of variances of the estimators of all individual items) of priority sampling is near-minimal among unbiased estimators

