1 / 12

Document sketching

Document sketching. Problem: duplicate or near-duplicate identification in a collection of documents How to measure the similarity between documents? A reasonable (?) candidate: edit distance Computationally expensive Another measure: resemblance due to [ Broder ‘97].

aglaia
Download Presentation

Document sketching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document sketching • Problem: duplicate or near-duplicate identification in a collection of documents • How to measure the similarity between documents? • A reasonable (?) candidate: edit distance • Computationally expensive • Another measure: resemblance due to [Broder ‘97]

  2. Resemblance of documents [Broder‘97] • : resemblance between documents and • . Similar means close to • Convert documents to a set of integers • A contiguous sequence of length contained in document is called a -shingle • Example: (a rose is a rose is a rose) • -shingles of are: (a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose) • The set of -shingles of : {(a rose is a), (rose is a rose), (is a rose is)} • Map shingles to integers (for some fixed ) • From now on, identify the documents with sets of integers in • Thus a document is represented as a set of integers • (also known as Jaccard similarity between sets and ) • Thus, , but does not mean • In practice, is a reasonable approximation of the informal notion of similarity of

  3. Estimating resemblance • Given : • Estimate: • Exact computation of requires time • A basic estimator for • : set of permutations • Choose a random • Variance too high

  4. Reducing variance First method • Sample random permutations • Sketch of document is • Resemblance can be estimated as (this is an unbiased estimator; proof follows from the previous slide) Second method • Let denote the set of smallest elements of , and if , then • For a constant , and uniformly random is an unbiased estimator of (details on the board)

  5. We can estimate within multiplicative error with (for both methods above) • The second method above gives us a way of sketching the documents: Fix a permutation , and a constant For document , its sketch is • Now given the sketches of documents , using the same permutation , we can estimate the resemblance of pairs • Sketch of a document takes space and estimating resemblance takes time • (We can also do it with the second method but we will need to store permutations)

  6. Document sketching in small space • One problem with this: storing permutations is expensive • Question: Can we work with a small set of permutations instead of ? • Yes: Min-wise independent permutations [Broder et al. ‘98] • Can also use 2-wise independent hash functions [Thorup 2013]

  7. Sampling from data streams

  8. Sampling from a data stream • How to select a uniformly random size subset of ? • Choose the th element with probability if elements have already been selected • What if the set is given via a stream and we don’t know its length in advance? • There is a solution similar to the previous one, but the following is easier: • For the th item, sample uniformly at random, keep the items with highest value of the

  9. Sampling for subset sum estimation • Given a stream of positive weights, we want to keep a small amount of information so that later we can estimate the weight of any given subset (the weight of a subset is the sum of the weights in it) First Solution (Poisson sampling) • Choose any probabilities for each weight • On encountering , include it in set with probability (independent of previous decisions) • Given any set (chosen in advance before the selection of ), estimator for • This is an unbiased estimator for • The expected number of samples

  10. Poisson sampling • Smaller sample set does not come for free: Variance in the estimate of the weight of the th item • One issue with this solution: The sample size is not fixed (although it can be concentrated around the mean) • Another issue: What should be the values of the ? If we want the sample to be size in expectation, then a possible choice is • But • may not be known • this sampling is not weight-sensitive: we may want to choose to be larger for larger to reduce the variance

  11. Priority sampling [Duffield et al. 2007] Second solution (priority sampling): • For each item, generate an independent uniform • Priority of item is given by • We assume all priorities are distinct (true with probability ) • For a given the priority sample of size is given by the items of highest priority • th priority, thus iff • For let if and otherwise Properties of priority sampling: • Maintains sample of fixed size • For • And so, for T (proof on the board; also in the Duffield et al. paper)

  12. Priority sampling properties We won’t prove the following: • For distinct , and have covariance • So the variance of the estimate of the weight of a set is the sum of the variances of the estimators for the items in the set • The total variance (sum of variances of the estimators of all individual items) of priority sampling is near-minimal among unbiased estimators

More Related