My Favorite Algorithms for Large-Scale Data Mining

My Favorite Algorithms for Large-Scale Data Mining Shingling Minhashing Locality-Sensitive Hashing

Similarity Search • A universal set of “objects.” • A collection of sets of objects. • Find the pairs of sets that are “similar.” • Jaccard similarity of sets = size of intersection divided by size of union.

Example: Jaccard Similarity 3 in intersection. 8 in union. Jaccard similarity = 3/8

Applications • Collaborative Filtering : Amazon customers as the set of products they buy. • Recommend what similar customers bought. • Similar Documents : A document as its set of k-shingles = strings of k consecutive characters. • Examples: news articles from same source, plagiarism.

Applications – (2) • Fingerprint Checking : Represent a fingerprint by the set of positions of minutiae. • Requires discretization. • Entity Resolution : Represent records describing individuals by sets of attribute/value pairs.

Key Ideas • Shingling : (Andrei Broder) Convert documents into sets. • Minhashing : (Edith Cohen, Broder) Construct small signatures for sets so Jaccard similarity of sets can be determined from the signatures. • Locality-Sensitive Hashing : (Rajeev Motwani, Piotr Indyk) Focus on (likely) similar pairs without looking at all pairs.

Candidate pairs : those pairs of signatures that we need to test for similarity. Locality- sensitive Hashing Minhash- ing The set of strings of length k that appear in the document Signatures : short integer vectors that represent the sets and reflect their similarity The Big Picture – Documents Shingling Docu- ment

Candidate pairs : those pairs of fingerprints that we need to test for similarity. Locality- sensitive Hashing Optional minhashing here The Big Picture – Fingerprints Extract minutiae and discret- ize Finger- print

When Is Similarity Interesting? • When the sets are so large or so many that they cannot fit in main memory. • When there are so many sets that comparing all pairs of sets takes too much time.

Shingling k -Shingles Documents as Sets

Shingles • A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document. • Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca}. • Option: regard shingles as a bag, and count ab twice. • Represent a doc by its set of k -shingles.

Working Assumption • Documents that have lots of shingles in common have similar text, even if the text appears in different order. • Careful: you must pick k large enough, or most documents will have most shingles. • k = 5 is OK for short documents; k = 10 is better for long documents.

Shingles: Compression Option • To compress long shingles, we can hash them to (say) 4 bytes. • Represent a doc by the set of hash values of its k-shingles. • Two documents could (rarely) appear to have shingles in common, when in fact only the hash-values were shared.

Minhashing Matrix Formulation Signatures Similarity of Signatures

Similarity as a Matrix Problem • Think of sets represented by a matrix of 0’s and 1’s. • Row = object. • Column = set. • 1 means that object is in that set.

* * * * * * * Example: Similarity of Columns C1 C2 u 0 1 v 1 0 w 1 1 Sim (C1, C2) = x 0 0 2/5 = 0.4 y 1 1 z 0 1 C1 = {v,w,y} C2 = {u,w,y,z}

Four Types of Rows • Given columns C1 and C2, rows may be classified as: C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 • Also, a = # rows of type a , etc. • Note Sim (C1, C2) = a /(a +b +c ).

Minhashing • Imagine the rows permuted randomly. • Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1. • Use several (100?) independent hash functions to create a signature with that number of integer hash-values.

Input matrix Signature matrix M 1 4 1 0 1 0 2 1 2 1 1 0 0 1 3 2 2 1 4 1 0 1 0 1 7 1 1 2 1 2 0 1 0 1 6 3 0 1 0 1 2 6 5 7 1 0 1 0 4 5 1 0 1 0 Minhashing Example

Surprising Property • The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). • Both are a /(a +b +c )! Why? • Look down columns C1 and C2 (in permuted order) until we see a 1. • If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures • The similarity of signatures is the fraction of the rows in which they agree. • Remember, each row corresponds to a permutation or “hash function.”

Implementation – (1) • You can’t really permute rows physically. • Good approximation to permuting rows: pick 100 (?) hash functions. • For each column c and each hash function hi , keep a “slot” M (i, c ) for that minhash value.

Implementation – (2) for each row r for each column c if c has 1 in row r for each hash function hido ifhi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1 h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x +1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

Implementation – (3) • Often, data is given by column, not row. • E.g., columns = documents, rows = shingles. • If so, sort matrix once so it is by row. • And always compute hi (r ) only once for each row.

Locality-Sensitive Hashing The All-Pairs Problem Banding of Signature Matrices Other LSH Techniques

Finding Similar Sets • We can use minhashing to replace sets (columns of the matrix) by short lists of integers. • But we still need to compare each pair of signatures. • Example: 20 million Amazon customers; 2*1014 pairs of customers to evaluate.

Locality-Sensitive Hashing • What we want seems impossible. Map signatures to buckets so that: • Two similar signatures have a very good chance of appearing in the same bucket. • If two signatures are not very similar, they probably don’t appear in one bucket. • Then, we only have to compare bucket-mates (candidate pairs ).

LSH for Signatures • Think of the signature for each column (set) as a column of the signature matrix S. • Divide the rows of S into bbands of r rows each.

Partition Into Bands – (1) r rows per band b bands One signature Matrix S

Partition into Bands – (2) • For each band, hash its portion of each column to a hash table with many buckets. • Candidatecolumn pairs are those that hash to the same bucket for ≥ 1 band. • Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Buckets Matrix S b bands r rows

Probability = 1 if s > t No chance if s < t Analysis of LSH – What We Want Probability of sharing a bucket t Similarity s of two sets

What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

No bands identical At least one band identical 1 - ( 1 - sr )b t ~ (1/b)1/r Some row of a band unequal All rows of a band are equal What b Bands of r Rows Gives You Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5

Summary of Minhash/LSH • Represent the objects you are comparing by sets (e.g., shingling). • Represent the sets by signatures (minhashing). • Use LSH to create buckets; candidate pairs are those in the same bucket. • Evaluate only the candidate pairs.

Application: LSH for Fingerprints • Place a grid on a fingerprint. • Normalize so identical prints will overlap. • Set of grid points where minutiae are located represents the fingerprint. • Possibly, treat minutiae near a grid boundary as if also present in adjacent grid points.

Minutia located here Maybe pretend it is here also Discretizing Minutiae

Applying LSH to Fingerprints • We could minhash the bit-vectors to obtain signatures. • But since there probably aren’t too many grid points, we can work from the bit-vectors directly.

LSH/Fingerprints – (2) • Pick 100 (?) sets of 3 (?) grid points, randomly. • For each set of three grid points, those prints that have 1 for all three points are placed in a bucket. • All pairs in this bucket are candidates.

Application: Matching Customer Records • I once took a consulting job solving the following problem: • Company A agreed to solicit customers for Company B, for a fee. • They then argued over how many customers. • Neither recorded exactly which customers were involved.

Customer Records – (2) • Company B had about 1 million records of all its customers. • Company A had about 1 million records describing customers, some of which it had signed up for B. • Records had name, address, and phone, but for various reasons, they could be different for the same person.

Customer Records – (3) • Step 1: Design a measure (“score ”) of how similar records are: • E.g., deduct points for small misspellings (“Jeffrey” vs. “Geoffery”) or same phone with different area code. • Step 2: Score all pairs of records; report high scores as matches.

Customer Records – (4) • Problem: (1 million)2 is too many pairs of records to score. • Solution: A simple LSH. • Three hash functions: exact values of name, address, phone. • Compare iff records are identical in at least one. • Misses similar records with a small difference in all three fields.

Aside: Validation of Results • We were able to tell what values of the scoring function were reliable in an interesting way. • Identical records had a creation date difference of 10 days. • We only looked for records created within 90 days, so bogus matches had a 45-day average.

Validation – (2) • By looking at the pool of matches with a fixed score, we could compute the average time-difference, say x, and deduce that fraction (45-x )/35 of them were valid matches. • Alas, the lawyers didn’t think the jury would understand.

The End Thanks for Listening

My Favorite Algorithms for Large-Scale Data Mining

My Favorite Algorithms for Large-Scale Data Mining

Presentation Transcript

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Algorithms for Large Data Sets

Data Mining Algorithms

Large scale genomic data mining

Algorithms for Large Data Sets

Large scale genomic data mining

Stability Analysis Algorithms for Large-Scale Applications

Large scale data processing

Efficient Algorithms for Large-Scale GIS Applications

Efficient Algorithms for Large-Scale Topology Discovery

STRING Large-scale data and text mining

Data Mining Algorithms for Large-Scale Distributed Systems

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

Data Mining Algorithms

Efficient Algorithms for Large-Scale GIS Applications