340 likes | 355 Views
This project explores the use of LSH functions to cluster nouns efficiently. It introduces a fast search algorithm to reduce the complexity from n^2 to n, enhancing NLP applications. Learn about preserving cosine similarity and dimension reduction for high-speed word clustering experiments.
E N D
Randomized Algorithms and NLPUsing Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO,Mohamed AbdElRahman Instructor: Dr. AnshumaliShrivastava †Rice University November 7, 2019
Outline • Problem Background • Theory • LSH Function Preserving Cosine Similarity (Dimension Reduction) • Fast Search Algorithm (From n2 to n) • Extending to NLP and Experiment
Motivation • What is the meaning of the word: “tezgüno” ?
Motivation • Consider the following Context: A bottle of tezgüno is on the table. Everyone likes tezgüno. Tezgüno makes you drunk. We make tezgüno out of corn. • Still not Sure?
Motivation • Consider the following Context: A bottle of tezgüno is on the table. Everyone likes tezgüno. Tezgüno makes you drunk. We make tezgüno out of corn. A bottle of beeris on the table. Everyone likes beer. Beer makes you drunk. We make beer out of corn. “Beer” and “tezgüno” have similar context, have similar meaning.
Motivation • We want this process to be done automatically by a computer! • So, the main task here is to find similar nouns! Noun Clustering
Problem Background • Task: Clustering Very Large scale nouns • n nodes (n nouns) • Each nodes has k features. (Details Later) • Calculate Full Similarity Matrix • Complexity: • Can not be tolerated when n is very large!
Problem Background • By 2000 • Over 500 billion readily accessible words on the web • Now • Very VeryVery Large amount! • We want linear: • Hashing is a good way!
Outline • Problem Background • Theory • LSH Function Preserving Cosine Similarity (Dimension Reduction) • Fast Search Algorithm (From n2 to n) • Extending to NLP and Experiment
LSH Function Preserving Cosine Similarity • The similarity measure between each node is Cosine Similarity. • Cosine Similarity • We want to design a hash function that preserve this similarity.
LSH Function Preserving Cosine Similarity • From the paper, the hash function is defined as follow: • In above, r is a spherically symmetric random vector of unit length.
LSH Function Preserving Cosine Similarity • Then, for vectors u and v, we have: • Or • Directly proportional!
LSH Function Preserving Cosine Similarity • From the equation bellow: • We can have • Then, we can estimate cosine similarity using:
LSH Function Preserving Cosine Similarity • Each vector u can be represented by a bit stream length d using the hash function.(etc. 001101 with d=6). • Then will be close related to hamming distance between u and v.
LSH Function Preserving Cosine Similarity • For example: • Given: • Then: • So, the Hamming Distance:
LSH Function Preserving Cosine Similarity • Convert: • Finding the cosine distance of two vectors • Finding the Hamming Distance between bit streams Dimension Reduction! But the complexity is still n2
Outline • Problem Background • Theory and Algorithm • LSH Function Preserving Cosine Similarity (Dimension Reduction) • Fast Search Algorithm (From n2 to n) • Extending to NLP and Experiment
Fast Search Algorithm • Task: • Given the signature for each vectors: • Stream bit (e.g. 1001) for each vectors. • Find the nearest neighbors for each vector.
Fast Search Algorithm • Apply qRandomly Permutationon each bit stream. • We can get q random permuted list. • Complexity: O(n) • For example: • Given a bit stream , and two permutation (q=2). • Then
Kn2 n log n Fast Search Algorithm • Sorting the q random permuted list, and find the nearest B neighbors on these sorted list. • Complexity: O(n log n) • For example • B=2, q=2 (Constant) v Kn2 Kn v 1 2
Question • What is the hamming distance between two bit stream: A=[0011000], B=[1111001]? • Ans. Hamming(A,B)=3 • Suppose we have two 2-dimension vectors u=[1,0], v=[0,1]. r is the spherically symmetric random vector. Then what is the value of ?
Outline • Problem Background • Theory and Algorithm • LSH Function Preserving Cosine Similarity (Dimension Reduction) • Fast Search Algorithm (From n2 to n) • Extending to NLP and Experiment
Calculation of Feature Vectors • Mutual Information Vector Used to measure the association strength between two words. Here, it is used between word (e) and feature (f) Cef, is the number of times word (e) occurred in context (f) N, is the total frequency count of all features of all words n, is the number of words For each noun, we have MI(e) = (mi(e1), mi(e2), … mi(ek))
Example • Soccer Quotes from the internet: A soccer team is like a beautiful woman. When you do not tell her, she forgets she is beautiful. (ArsèneWenger) In his life, a man can change wives, political parties or religions but he cannot change his favorite soccer team.(Eduardo Hughes Galeano) “Don’t change your wife” • Removing stop words, and identifying nouns: Soccer team like beautiful woman tell forgets beautiful. Life man change wives political partiesreligions change favorite soccer team. Features:, 2 left, (for each noun), 2 right All Nouns (5): {Soccer team, woman, wives, parties, religions} All Features (11):{Like, beautiful, tell, forgets, man, change, political, parties, wives, religions, favorite}
Example Soccer team like beautiful woman tell forgets beautiful. Life man change wives political partiesreligions change favorite soccer team.
Example • MI(soccer team) = (mi(soccer team, like), mi(soccer team, beautiful), … mi(soccer team, favorite) • mi(soccer team, like) = log (1/20) / (2/20) X (4/20) ~ (0.4 )
Evaluation: LSH function d↑, Error↓, Time ↑ • Randomly choose 100 nouns (vectors) from the web collection (using the web corpus dataset) (i) is for all pairs with CS(real,i) >= 0.15
Evaluation: Fast Hamming Distance • Randomly choose 100 nouns from the (web corpus dataset). For each, calculate all pairwise hamming distance manually. Filter for those >= 0.15 “Gold Standard test set”. • Obtain a list of bit streams for all nouns from Web Corpus Dataset for hamming distance calculation. • Compare Top N elements retrieved by the fast hamming distance against those in the gold standard test set (calculate percentage overlap).
Evaluation: Fast Hamming Distance B↑, q↑, Accuracy↑ , Search Time↑
Evaluation: Final Similarity Lists • Using (the Newspaper Corpus). • Randomly choose 100 nouns and calculate top N elements using the randomized algorithm, and compare with those resulted from (Pantel and Lin (2002) system) and calculate (percentage overlap).
Summary • Using random vectors, we manage to represent each noun as a bit of stream of length d << number of features, which result in dimensionality reduction. • The proposed method reduced the running time from quadratic time to kn, with similarity accuracy of ~ 70%.
Randomized Algorithms and NLPUsing Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO,Mohamed AbdElRahman Instructor: Dr. AnshumaliShrivastava †Rice University November 7, 2019