470 likes | 663 Views
Three Cool Algorithms You’ve Never Heard Of!. Carey Nachenberg cnachenberg@symantec.com. Cool Data Structure: The Metric Tree. City: LA Threshold: 1500km. City: SF Threshold: 100km. City: Austin Threshold: 250km. City: San Jose
E N D
Three Cool Algorithms You’ve Never Heard Of! Carey Nachenberg cnachenberg@symantec.com
Cool Data Structure: The Metric Tree City: LA Threshold: 1500km City: SF Threshold: 100km City: Austin Threshold: 250km City: San Jose Threshold: 200km City: Merced Threshold: 70km City: Providence Threshold: 200km City: New Orleans Threshold: 300km City: Atlanta Threshold: 600km City: NYC Threshold: 1100km City: Las Vegas Threshold: 1000km City: Boston Threshold: 400km >1500km away <=1500km away <=1110km away >300km away >200km away >1100km away >70km away >600km away >1000km away >100km away <=100km away <=1000km away <=400km away <=70km away <=300km away <=200km away <=600km away >200km away <=200km away … … … … … … … … … … … …
Suggestions Challenge: Building a Fuzzy Spell Checker Imagine you’re building a word processor and you want to implement a spell checker that gives suggestions… lobeky lonely lovely locale … Of course it’s easy to tell the user that their word is misspelled… Question: What data structure could we use to determine if a word is in a dictionary or not? But what if we want to efficiently provide the user with possible alternatives? Right – a hash table or binary search tree could tell you if a word is spelled correctly.
l v Providing Alternatives? So given the user’s misspelled word, and this edit distance function… Before we can provide alternatives, we need a way to find close matches… One useful tool for this is the “edit distance” metric. How can we use this to provide the user with spelling suggestions? Edit Distance: How many letters must be added, deleted or replaced to get from word A to B. lobeky -> lovely has an edit distance of 2. l o b e k y -> lowly has an edit distance of 3. w l
Providing Alternatives? Well, we could take our misspelled word and compute its edit distance to every word in the dictionary! 8 aardvark ark acorn … bone bonfire … lonely lonesome … lobeky 5 But that’s really, really slow! 6 There’s a better way! But before we talk about it, let’s talk about edit distance a bit more… And then give the user all words with an edit distance of <=3…
Edit Distance As it turns out, the edit distance function, e(x,y), is what we call a “metric distance function.” What does that mean? The edit distance of “foo” from “food”is the same as from “food” to “foo” 1. e(x,y) = e(y,x) You can never have a negativeedit distance… Well that makes sense… 2. e(x,y) >= 0 It’s never cheaper to do two conversions than a direct conversion. 3. e(x,z) <= e(x,y) + e(y,z) aka “the triangle inequality” > e(“foo”,”feed”) = 3 e(“feed”,”goon”) = 4 Total cost: 7 e(“foo”,”goon”) = 2
-1 +1 +3 Metric Distance Functions Given some word w (e.g., pier), let’s say I happen to know all words with an edit distance of 1 from that word… tier piper Why? Because we know that all of these words have at most one character difference from “pier”… peer pier Now, if my misspelled word m (e.g., zifs) has an edit distance of 3 from w, what does that guarantee about m to these other words? pie pies So if “pier” is 3 away from “zifs”, then in the best case these other words would be one letter closer to “zifs” (e.g., if one of pier’sletters was replaced by one of zifs’ letters)... Right: If e(“zifs”,”pier”) is 3, and all these other words are exactly 1 edit away from pier… Imagine if we had thousands of different clouds like this. zifs Then by definition, “zifs” must be at most4 edits away from any word in this cloud! e(“zifs”,”pier”) = 3 e(“pier”,”piper”) = 1 Total cost: 4 We could compare your misspelled word to the center word of each cloud. If e(m,w) is less than some threshold edit distance, then the cloud’s other words are good suggestions… But by the same reasoning, none of these words can be less than 2 edits away from “zifs”… Let’s see: e(“zifs”,”pies”) = 2 And directly: e(“zifs”,”piper”) = 4
rate date hate gate table computer pencil ate gale 3 Metric Distance Functions tier piper peer pier 4 pie 5 pies 5 8 zifs We could compare your misspelled word to the center word of each cloud. If e(m,w) is less than some threshold edit distance, then the cloud’s other words are good suggestions…
A Better Way? That works well, but then again, we’d still have to do thousands of comparisons(one to each cloud)… Hmmm. Can we figure out a more efficient way to do this? Say with log2(D) comparisons, where D is the number of words in your dictionary? Duh… Well of course, we’ll need a tree!
The Metric Tree The Metric Tree was invented in 1991 by Jeffrey Uhlmann of the Naval Research Labs. Each node in a Metric Tree holds a word, an edit distance threshold value and left and right next pointers. struct MetricTreeNode { string word; unsigned int editThreshold; MetricTreeNode *left, *right; }; Let’s see how to build a Metric Tree! Building one is really slow, but once we build it, searching it is really fast!
The Metric Tree SetOfWords goat oyster roster hippo toad hamster mouse chicken rooster Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 7. N->right = buildMTree(subset of S that is > dmed) 8. return N main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S);
Node *buildMTree(SetOfWords &S) Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 6. N->left = buildMTree(subset of S that is <= dmed) 7. N->right = buildMTree(subset of S that is > dmed) 7. N->right = buildMTree(subset of S that is > dmed) 8. return N 8. return N The Metric Tree SetOfWords goat oyster roster hippo toad hamster mouse chicken SetOfWords roster 1 oyster 2 hamster 3 mouse 4 goat 6 toad 6 hippo 7 chicken 7 6 2 dmed = 4 1 7 6 3 4 7 rooster “rooster” 4 main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S);
Node *buildMTree(SetOfWords &S) Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 6. N->left = buildMTree(subset of S that is <= dmed) 7. N->right = buildMTree(subset of S that is > dmed) 7. N->right = buildMTree(subset of S that is > dmed) 8. return N 8. return N The Metric Tree Dictionary goat oyster roster hippo toad hamster mouse chicken SetOfWords roster oyster hamster goat toad hippo chicken 6 4 dmed = 4 3 4 1 6 7 mouse 7 rooster “rooster” 4 “mouse” 4 main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); “oyster” 4 “hamster” 0 “roster” 0
Node *buildMTree(SetOfWords &S) Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 6. N->left = buildMTree(subset of S that is <= dmed) 7. N->right = buildMTree(subset of S that is > dmed) 7. N->right = buildMTree(subset of S that is > dmed) 8. return N 8. return N The Metric Tree Dictionary goat oyster roster hippo toad hamster mouse chicken SetOfWords roster oyster hamster goat hippo chicken dmed = 5 mouse 2 toad 5 7 “rooster” 4 “toad” 5 “mouse” 4 main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); “oyster” 4 “goat” 5 “hamster” 0 “chicken” 0 “roster” 0 “hippo” 0
Node *buildMTree(SetOfWords &S) 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance di to your random word W. 4. Select the median value of di, let dmed be this median edit distance. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to dmed. 6. N->left = buildMTree(subset of S that is <= dmed) 7. N->right = buildMTree(subset of S that is > dmed) 8. return N The Metric Tree Dictionary goat oyster roster hippo toad hamster mouse chicken SetOfWords roster oyster hamster goat hippo chicken mouse toad “rooster” 4 “toad” 5 “mouse” 4 main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); “oyster” 4 “goat” 5 “hamster” 0 “chicken” 0 “roster” 0 “hippo” 0
So now we have a metric tree! rooster “rooster” 4 A Metric Tree 4 How do we interpret it? toad mouse “toad” 5 “mouse” 4 5 4 goat chicken oyster hamster “oyster” 4 “goat” 5 “hamster” 0 “chicken” 0 5 hippo “hippo” 0 “roster” 0 roster Every word to the left of rooster is guaranteed to be within 4 edits of it… And every word to the right of rooster is guaranteed to be more than 4 edits away… And this same structure is repeated recursively!
2 roaster “rooster” 4 Searching “toad” 5 “mouse” 4 “oyster” 4 “goat” 5 “hamster” 0 “chicken” 0 “hippo” 0 “roster” 0 When you search a metric tree, you specify the word you’re looking for and an edit-distance radius, e.g. 1. Your word and its search radius are totally inside the edit threshold. chicken oyster toad e.g., I want to find words within 2 edits of “roaster”. mouse rooster In this case, all of your matches are guaranteed to be in our left subtree… hippo Starting at the root, there are three cases to consider: roster goat hamster
2 goute “rooster” 4 Searching “toad” 5 “mouse” 4 “oyster” 4 “goat” 5 “hamster” 0 “chicken” 0 “hippo” 0 “roster” 0 2. Your word and its search radius are partially inside and partially outside the edit threshold. chicken oyster toad mouse In this case, some matches will be in our left subtree and some in our right subtree… rooster hippo roster goat hamster
2 vhivken “rooster” 4 Searching “toad” 5 “mouse” 4 “oyster” 4 “goat” 5 “hamster” 0 “chicken” 0 “hippo” 0 “roster” 0 3. Your word and its search radius are completely outside the edit threshold. chicken oyster In this case, all matches will be in our right subtree. toad mouse rooster hippo roster goat hamster
e(“chomster”,”mouse”) = 5 So mouseis outside of chomster’sradius of 2. It’s not a close enough match to print… Metric Tree: Search Algorithm *This is a slight simplification… chomster chomster 5 2 2 2 hamster chomster 2 e(“chomster”,”hamster”) = 2 So hamster is inside of chomster’sradius of 2. We’ve got a match! Print hamster! e(“chomster”,”rooster”) = 3 So roosteris outside of chomster’sradius of 2. It’s not a close enough match to print… PrintMatches(Node *cur, string misspell, intrad) { if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word)<=cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word)>cur->editThreshold then PrintMatches(cur->right); } oyster PrintMatches(Node *cur, string misspell, int rad) { if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word)<=cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word)>cur->editThreshold then PrintMatches(cur->right); } 3 e(“chomster”,”mouse”) = 5 Since 5 is greater than our editThreshold of 4, we won’t go left. PrintMatches(Node *cur, string misspell, int rad) { if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word)<=cur->editThreshold then PrintMatches(cur->left) if e(misspell,cur->word)>cur->editThreshold then PrintMatches(cur->right); } hamster e(“chomster”,”rooster”) = 3 Since 3 is less than our editThreshold of 4, let’s go left… mouse e(“chomster”,”mouse”) = 5 Since 5 is greater than our editThreshold of 4, we will go right. rooster 4 roster cur-> PrintMatches(root,”chomster”,2); cur-> oyster chicken toad cur-> mouse goat hippo roster hamster
Other Metric Tree Applications In addition to spell checking, the Metric Tree can be used with virtually any application where the items obey metric rules! Pretty cool, huh? Here’s the full search algorithm from the original paper (without my earlier simplications): PrintMatches(Node *cur, string misspell, int rad) { if ( e(cur->word , misspell) <= rad) cout << cur->word; if ( e(cur->word,misspell) – rad <= cur->editThresh ) PrintMatches(cur->left,misspell,maxDist) if ( e(cur->word, misspell) + rad >= cur->editThresh ) PrintMatches (cur->right,misspell,maxDist); }
Challenge: Space-efficient Set Membership There are many problems where we want to maintain a set S of items and then check if a new item X is in the set, e.g.: “Is ‘carey nachenberg’ a student at UCLA?” “Is the phone number ‘424-750-7519’ known to be used by a terrorist cell? So, what data structures could you use for this? Right! Both hash tables and binary search trees allow you to: • Hold a bunch of items. • Quickly search through them to see if they hold an item X.
So what’s the problem! Well, binary search trees and hash tables are memory hogs! But if I JUST want to do two things: • Add new items to the set • Check if an item was previously added to a set I can actually create a much more memory efficient data structure! In other words, if I never need to: • Print the items of the set (after they’ve been added). • Enumerate each value in the set. • Erase items from the set. Then we can do much better than our classic data structures!
* But first… A hash primer A hashfunction is a function, y=f(x), that takes an input x (like a string) and returns an output number y for that input. The ideal hash function returns entirely different values foreach different input, even if two inputs are almost identical: int y,z; y = idealHashFunction(“carey”); cout << y; z = idealHashFunction(“cArey”); cout << z; So even though these two strings are almost identical, a goodhash function might return y=92629 and z=152. * Not that kind of hash.
total = total + (name[i] * (i+1)); Hash Functions Here’s a not-so-good hash function. int hashFunc(const string &name) { int i, total=0; for (i=0;i<name.length(); i++) total = total + name[i]; return(total); } Can anyone figure out why? Right – because similar inputs produce the same output: int y, z; y = hashFunc(“bat”); z = hashFunc(“tab”); // y == z!!!! BAD! How can we fix this? By changing our function! That’s a little better, although not great…
A Better Hash Function The CRC or Cyclical Redundancy Check algorithm is an excellent hash function. This function was designed to check network packets for corruption. We won’t go into CRC’s details, but it’s a perfectly fine hashing algorithm… Ok, so we have a good hash function, now what?
slot slot s 000000000000000000000000000000000000000000000000000000 A Simple Set Membership Algorithm Most hash functions require a seed (initialization) value to be passed in. Here’s how it might be used: unsigned CRC(unsigned seed, string &s) { unsigned crc= seed; for (inti=0;i<s.length();i++) crc= ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc); } Typically you’d use a seed value of 0xFFFFFFFF with CRC. But you can change the seed if you like – this results in a (much) different hash value, even for the same input! Imagine that I know I want to store up to 1 million items in my set… class SimpleSet { public: … private: BitArray m_arr[100000000]; 3000012131 9721 12131 I could create an array of say… “Carey” “Flint” void insertItem(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; m_arr[slot] = 1; } 100 million bits And then do the following… 1 1 “Flint” boolisItemInSet(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } main() { SimpleSet s; s.insertItem(“Carey”); s.insertItem(“Flint”); if (s.isItemInSet(“Flint”) == true) cout << “Flint’s in my set!”; } 9721
slot slot cool People 000000000000000000000000000000000000000000000000000000 A Simple Set Membership Algorithm Ok, so what’s the problem with our SimpleSet? class SimpleSet { public: … private: BitArray m_arr[100000000]; 3000012131 12131 Right! There’s a chance of collisions! void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } What if two names happen to hash right to the same slot? 1 boolisItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } main() { SimpleSetcoolPeople; coolPeople.insertItem(“Carey”); if (coolPeople.isItemInSet(“Paul”)) cout << “Paul Agbabianis cool!”; } 11000012131 12131
A Simple Set Membership Algorithm Ok, so what’s the problem with our SimpleSet? class SimpleSet { public: … private: BitArray m_arr[100000000]; Right! There’s a chance of collisions! void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } What if two names happen to hash right to the same slot? boolisItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } Ack! If we put 1 million items in our 100 million entry array… we’ll have a collision rate of about 1%! Actually, depending on your requirements,that might not be so bad…
A Simple Set Membership Algorithm Our simple set can hold about 1M items in just 12.5MB of memory! While it does have some false-positives, it’s much smaller than a hash table or binary search tree… But we’ve got to be able to do better… Right? Right! That’s where the Bloom Filter comes in! The Bloom Filter was invented by Burton Bloom in 1970. Let’s take a look!
slot cool People 000000000000000000000000000000000000000000000000000000 • We’ll see how K is chosen in a bit. It’s a constant and its value is computed from: • The max # of items you want to add. • The size of the array. • Your desired false positive rate. • const int K = 4; The Bloom Filter In a Bloom Filter, we use an array of bits just like our original algorithm! class BloomFilter { public: … private: BitArraym_arr[100000000]; But instead of just using1 hash functionand setting just one bit for each insertion… void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } Notice that each time we call the CRC function, it starts with a different seed value: unsigned CRC(unsigned seed, string &s) { unsigned crc = seed; for (inti=0;i<s.length();i++) crc = ((crc >> 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc); } (Passing K different seed values is the same as using K different hash functions…) We use K hash functions, compute K hash values and set K bits! 9000022531 79929 9197 3000000013 22531 13 1 1 1 1 main() { BloomFiltercoolPeople; coolPeople.insertItem(“Preston”); }
cool People 000000000000000000000000000000000000000000000000000000 The Bloom Filter Now to search, we do the same thing! class BloomFilter { public: … private: BitArray m_arr[100000000]; Note: We only say an item is a member of the set if all K bits are set to 1. void insertItem(string &name) { for (inti=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } Note: If any bit that we check is 0, then we have a miss… boolisItemInSet(string &name) { for (inti=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); } 1 1 1 main() { BloomFiltercoolPeople; coolPeople.insertItem(“Preston”); } 1 if (coolPeople.isItemInSet(“Carey”)) cout << “I figured…”;
The Bloom Filter class BloomFilter { public: private: BitArray m_arr[100000000]; Ok, so what’s the big deal? All we’re doing is checking K bits instead of 1?!!? void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; m_arr[slot] = 1; } } Well, it turns out that this dramatically reduces the false positive rate! Ok… So the only questions are, how do we chose: boolisItemInSet(string &name) { for (inti=0;i< K ;i++) { int slot = CRC( i , name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); } • The size of our bit-array? • The value of K? Let’s see!
The Bloom Filter If you want to store N items in your Bloom Filter… You’ll want to have M bits in your bit array: M = log(F) * N log(.6185) And you want a false positive rate of F%... And you’ll want to use K different hash functions: K=.7* M N Now you’ve got to admit, that’s pretty efficient! Let’s see some stats! Of course, unlike a hash table, there is some chanceof having a false positive… To store: N itemswith this FP rate, use M bits (bytes) and K hash fns But for many projects, this is not an issue, especially if you can guarantee a certain minimum level of FPs! 1M .1% 14.4M bits (1.79MB) 10 100M .001% 2.4B bits (299MB) 17 Now that’s COOL! And you’ve (hopefully) never heard about it! 100M .00001% 3.4B bits (419MB) 23
Challenge: Constant-time searching for similar items (in a high-dimensional space) Problem: I’ve got a large collection C of existing web-pages, and I want to determine if a new web-page P is a close match to any pages in my existing collection. Obvious approach: I could iterate through all C of my existing pages and do a pair-wise comparison of page P to each page. But that’s inefficient! So how can we do it faster?
Answer: Use Locality Sensitive Hashing! LSH has two operations: Inserting items into the hash table: We add a bunch of items (e.g., web pages) into a locality-sensitive hash table Given an item, find closely-related items in the hash table: Once we have a filled locality-sensitive hash table, we want to search it for a new item and see if it contains anything similar.
LSH, Operation #1: Insertion Here’s the Insertion algorithm: Step #1: Take each input item (e.g., a web-page) and convert it to a feature vector of size V. What’s a feature vector? It’s a fixed-length array of floating point numbers that measure various attributes about each input item. const int V = 6; float fv[V]; fv[0] = # of times the word “free” was used in the email fv[1] = # of times the word “viagra” was used in the email fv[2] = # of exclamation marks used in the email fv[3] = The length of the email in words fv[4] = The average length of each word found in the email fv[5] = The ratio of punctuation marks to letters in the email fv[5] = # of times the word “the” was used in the email The items in the feature vector should be chosen to provide maximum differentiation between different categories of items (e.g., spam vs clean email)!
LSH, Operation #1: Insertion Why compute a feature vector for each input item? The feature vector is a way of plotting each item into N-space. In principle, items (e.g. emails) with similar content (i.e., similar feature vectors) should occupy similar regions of N-space. Input #1: “Click here now for free viagra!!!!!” fv1 = {1, 1, 5, 6, 4.17, 0.2} } Input #2: “Please come to the meeting at 5pm.” } fv2 = {0, 0, 1, 7, 3.71, 0.038} 1.0 fv2 1.0 5.0 fv1
LSH, Operation #1: Insertion Step #2: Once you have a feature vector for each of your items, you determine the size of your hash table. Note: N must be apower of 2, e.g., 65536, or 1,048,576 “I’m going to need to hold 100 million email feature vectors, so I’ll want an open hash table of size N = 1 million” Wait! Why is our hash table smaller than the # of items we want to store? Because we want to put related items in the same bucket/slot of the table! Step #3: Next compute the number of bits B required to represent N in binary. If N is 1 million, B will be log2(1 million), or 20.
LSH, Operation #1: Insertion Step #4: Now, create B (e.g., 20) RANDOM feature vectors that are the same dimension as your input feature vectors. R1 = {.277,.891,3,.32,5.89, .136} R2 = {2.143,.073,0.3,4.9, .58, .252} … R19 = {.8,.425,6.43,5.6,.197,1.43} R20 = {1.47,.256,4.15,5.6,.437,.075}
LSH, Operation #1: Insertion What are these B random vectors for? Each of the B random vectors defines a hyper-plane in N-space! (each hyper-plane is perpendicular to its random vector) R1 = {1,0,1} R2 = {0,0,3} R3 = {0,2.5,0} If we have B such random vectors, we essentially chop up N-space with B possibly overlapping slices! So in our example, we’d have B=20 hyper-planes chopping up our V=6 dimensional space. (Chopping it up into 220 different regions!)
LSH, Operation #1: Insertion Ok, let’s consider a single random vector, R1, and it’s hyper-plane for now. Now let’s consider a second vector, v1. If the tips of those two vectors are on the same side of R’s hyper-plane, then the dot-product of the two vectors will be positive. R1· v1> 0 R1 v1 On the other hand, if the tips of those two vectors are on opposite sides of R’s hyper-plane, then the dot-product of the two vectors will be negative. R1· v2< 0 v2 So this is useful – if we compute the dot product of two vectors R and v, we can determine if they’re close to each other or far from each other in N-space.
LSH, Operation #1: Insertion Step #5: Create an empty open hash tablewith 2B buckets (e.g. 220 = 1M). Step #6: For each item we want to add to our hash table… Take the feature vector for the item... And dot-product multiply it by every one of our B random-valued vectors… 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 · {1, 1, 5, 6, 4.17, 0.2} {1, 1, 5, 6, 4.17, 0.2} “Click here now for free viagra!!!!!” is on the… … R1 = {.277,.891,3,.32,5.89, .136} R2 = {2.13,.07,0.3,4.9, .58, .252} … R19 = {.8,.45,6.3,5.6,.197,1.43} R20 = {1.7,.26,4.15,5.6,.47,.07} -3.25 0 0 … 1 1 Opp. side of R1 Opp. side of R2 … Same side as R19 Same side as R20 -1.73 .18 Let’s label each bucket’s # using binary rather than decimal numbers. (You’ll see why soon ) 5.24 And if we concatenate the 1s and 0s, this gives us a B-digit (e.g., 20 digit) binary number. This basically tells us whether our feature vector is on the same side or the opposite side of the hyper-plane of every one of our random vectors. Which we can use to compute a bucket number in our hash table and store our item! Now convert every positive dot-product to a 1 And convert every negative dot-product into a 0
LSH, Operation #1: Insertion Basically, every item in bucket 0000000000000 will be on the opposite sides of hyper-planes of all the random vectors. 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 And every item in bucket 111111111111111 will be on the same side of the hyper-planes of all the random vectors. “Click here now for free viagra!!!!!” {1, 1, 5, 6, 4.17, 0.2} … And items in bucket 000000000001 will be on the same side as R20, but the opposite side of R1, R2… R19. So each bucket essentially represents one of the 220 different regions of N-space, as divided by the 20 random hyper-plane slices.
LSH, Operation #2: Searching Searching for closely-related items is the same as inserting! Step #1: Compute the feature vector for your item 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 Step #2: Dot-product multiply this vector by your B random vectors “Click here now for free viagra!!!!!” {1, 1, 5, 6, 4.17, 0.2} … Step #3: Convert all positive dot-products to 1, and all negative dot-products to 0 Step #4: Use the concatenated binary number to pick a bucket in your hash table And viola – you’ve located similar feature vectors/items!
LSH, One Last Point… Typically, we don’t just use one LSH hash table… But we use two or more, each with a different set of random vectors! Why? Then, when searching for a new vector V, we take the union of all buckets that V hashes to, from all hash tables to obtain a list of matches.