370 likes | 526 Views
Scalable Techniques for Clustering the Web. Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu. Project Goals. Generate fine-grained clustering of web based on topic Similarity search (“What’s Related?”) Two major issues:
E N D
Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu
Project Goals • Generate fine-grained clustering of web based on topic • Similarity search (“What’s Related?”) • Two major issues: • Develop appropriate notion of similarity • Scale up to millions of documents
Prior Work • Offline: detecting replicas • [Broder-Glassman-Manasse-Zweig’97] • [Shivakumar-G. Molina’98] • Online: finding/grouping related pages • [Zamir-Etzioni’98] • [Manjara] • Link based methods • [Dean-Henzinger’99, Clever]
Prior Work: Online, Link • Online: cluster results of search queries • does not work for clustering entire web offline • Link based approaches are limited • What about relatively new pages? • What about less popular pages?
Prior Work: Copy detection • Designed to detect duplicates/near-replicas • Do not scale when notion of similarity is modified to ‘topical’ similarity • Creation of document-document similarity matrix is the core challenge: join bottleneck
Pairwise similarity • Consider relation Docs(id, sentence) • Must compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.sentence = D2.sentence GROUP BY D1.id, D2.id HAVING count(*) > • What if we change ‘sentence’ to ‘word’?
Pairwise similarity • Relation Docs(id, word) • Compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.word = D2.word GROUP BY D1.id, D2.id HAVING count(*) > • For 25M urls, could take months to compute!
Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters
Document representation • Bag of words model • Bag for each page p consists of • Title of p • Anchor text of all pages pointing to p (Also include window of words around anchors)
Bag Generation http://www.foobar.com/ http://www.music.com/ ...click here for a great music page... MusicWorld ...click here for great sports page... Enter our site http://www.baz.com/ ...what I had for lunch... ...this music is great...
Bag Generation • Union of ‘anchor windows’ is a concise description of a page. • Note that using anchor windows, we can cluster more documents than we’ve crawled: • In general, a set of N documents refers to cN urls
Standard IR • Remove stopwords (~ 750) • Remove high frequency & low frequency terms • Use stemming • Apply TFIDF scaling
Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters
Similarity • Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively • sim(U1, U2) = |B1 B2| / |B1 B2| • Threshold is set to 20%
Reality Check www.foodchannel.com: www.epicurious.com/a_home/a00_home/home.html .37 www.gourmetworld.com .36 www.foodwine.com .325 www.cuisinenet.com .3125 www.kitchenlink.com .3125 www.yumyum.com .3 www.menusonline.com .3 www.snap.com/directory/category/0,16,-324,00.html .2875 www.ichef.com .2875 www.home-canning.com .275
Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters
Pair Generation • Find all pairs of pages (U1, U2) satisfying sim(U1, U2) 20% • Ignore all url pairs with sim < 20% • How do we avoid the join bottleneck?
Locality Sensitive Hashing • Idea: use special kind of hashing • Locality Sensitive Hashing (LSH) provides a solution: • Min-wise hash functions [Broder’98] • LSH [Indyk, Motwani’98], [Cohen et al’2000] • Properties: • Similar urls are hashed together w.h.p • Dissimilar urls are not hashed together
Locality Sensitive Hashing music.com opera.com sing.com sports.com golf.com
Hashing • Two steps • Min-hash (MH): a way to consistently sample words from bags • Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not
Step 1: Min-hash • Step 1: Generate m min-hash signatures for each url (m = 80) • For i = 1...m • Generate a random order hi on words • mhi(u) = argmin {hi(w) | w Bu} • Pr(mhi(u) = mhi(v)) = sim(u, v)
Step 1: Min-hash Round 1: ordering = [cat, dog, mouse, banana] Set A: {mouse, dog} MH-signature = dog Set B: {cat, mouse} MH-signature = cat
Step 1: Min-hash Round 2: ordering = [banana, mouse, cat, dog] Set A: {mouse, dog} MH-signature = mouse Set B: {cat, mouse} MH-signature = mouse
Step 2: LSH • Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3) • For i = 1...l • Randomly select k min-hash indices and concatenate them to form i’th LSH signature
Step 2: LSH • Generate candidate pair if u and v have an LSH signature in common in any round • Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k
Step 2: LSH Set A: {mouse, dog, horse, ant} MH1 = horse MH2 = mouse MH3 = ant MH4 = dog LSH134 = horse-ant-dog LSH234 = mouse-ant-dog Set B: {cat, ice, shoe, mouse} MH1 = cat MH2 = mouse MH3 = ice MH4 = shoe LSH134 = cat-ice-shoe LSH234 = mouse-ice-shoe
Step 2: LSH • Bottom line - probability of collision: • 10% similarity 0.1% • 1% similarity 0.0001%
Step 2: LSH Round 1 sports.com golf.com party.com music.com opera.com . . . . . . sing.com sport- team- win music- sound- play sing- music- ear
Step 2: LSH Round 2 sports.com golf.com music.com sing.com . . . . . . opera.com game- team- score audio- music- note theater- luciano- sing
Sort & Filter • Using all buckets from all LSH rounds, generate candidate pairs • Sort candidate pairs on first field • Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MH-signatures • Ready for “What’s Related?” queries...
Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters
Clustering • The set of document pairs represents the document-document similarity matrix with 20% similarity threshold • Clustering algorithms • S-Link: connected components • C-Link: maximal cliques • Center: approximation to C-Link
Center • Scan through pairs (they are sorted on first component) • For each run [(u, v1), ... , (u, vn)] • if u is not marked • cluster = u + unmarked neighbors of u • mark u and all neighbors of u
Results 20 Million urls on Pentium-II 450
Sample Cluster feynman.princeton.edu/~sondhi/205main.html hep.physics.wisc.edu/wsmith/p202/p202syl.html hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html pdg.lbl.gov/mc_particle_id_contents.html physics.ucsc.edu/courses/10.html town.hall.org/places/SciTech/qmachine www.as.ua.edu/physics/hetheory.html www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html www.phy.duke.edu/Courses/271/Synopsis.html . . . (total of 27 urls) . . .
Ongoing/Future Work • Tune anchor-window length • Develop system to measure quality • What is ground truth? • How do you judge clustering of millions of pages?