Scalable Techniques for Clustering the Web

Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu

Project Goals • Generate fine-grained clustering of web based on topic • Similarity search (“What’s Related?”) • Two major issues: • Develop appropriate notion of similarity • Scale up to millions of documents

Prior Work • Offline: detecting replicas • [Broder-Glassman-Manasse-Zweig’97] • [Shivakumar-G. Molina’98] • Online: finding/grouping related pages • [Zamir-Etzioni’98] • [Manjara] • Link based methods • [Dean-Henzinger’99, Clever]

Prior Work: Online, Link • Online: cluster results of search queries • does not work for clustering entire web offline • Link based approaches are limited • What about relatively new pages? • What about less popular pages?

Prior Work: Copy detection • Designed to detect duplicates/near-replicas • Do not scale when notion of similarity is modified to ‘topical’ similarity • Creation of document-document similarity matrix is the core challenge: join bottleneck

Pairwise similarity • Consider relation Docs(id, sentence) • Must compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.sentence = D2.sentence GROUP BY D1.id, D2.id HAVING count(*) >  • What if we change ‘sentence’ to ‘word’?

Pairwise similarity • Relation Docs(id, word) • Compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.word = D2.word GROUP BY D1.id, D2.id HAVING count(*) >  • For 25M urls, could take months to compute!

Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters

Document representation • Bag of words model • Bag for each page p consists of • Title of p • Anchor text of all pages pointing to p (Also include window of words around anchors)

Bag Generation http://www.foobar.com/ http://www.music.com/ ...click here for a great music page... MusicWorld ...click here for great sports page... Enter our site http://www.baz.com/ ...what I had for lunch... ...this music is great...

Bag Generation • Union of ‘anchor windows’ is a concise description of a page. • Note that using anchor windows, we can cluster more documents than we’ve crawled: • In general, a set of N documents refers to cN urls

Standard IR • Remove stopwords (~ 750) • Remove high frequency & low frequency terms • Use stemming • Apply TFIDF scaling

Similarity • Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively • sim(U1, U2) = |B1 B2| / |B1  B2| • Threshold is set to 20%

Reality Check www.foodchannel.com: www.epicurious.com/a_home/a00_home/home.html .37 www.gourmetworld.com .36 www.foodwine.com .325 www.cuisinenet.com .3125 www.kitchenlink.com .3125 www.yumyum.com .3 www.menusonline.com .3 www.snap.com/directory/category/0,16,-324,00.html .2875 www.ichef.com .2875 www.home-canning.com .275

Pair Generation • Find all pairs of pages (U1, U2) satisfying sim(U1, U2)  20% • Ignore all url pairs with sim < 20% • How do we avoid the join bottleneck?

Locality Sensitive Hashing • Idea: use special kind of hashing • Locality Sensitive Hashing (LSH) provides a solution: • Min-wise hash functions [Broder’98] • LSH [Indyk, Motwani’98], [Cohen et al’2000] • Properties: • Similar urls are hashed together w.h.p • Dissimilar urls are not hashed together

Locality Sensitive Hashing music.com opera.com sing.com sports.com golf.com

Hashing • Two steps • Min-hash (MH): a way to consistently sample words from bags • Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not

Step 1: Min-hash • Step 1: Generate m min-hash signatures for each url (m = 80) • For i = 1...m • Generate a random order hi on words • mhi(u) = argmin {hi(w) | w  Bu} • Pr(mhi(u) = mhi(v)) = sim(u, v)

Step 1: Min-hash Round 1: ordering = [cat, dog, mouse, banana] Set A: {mouse, dog} MH-signature = dog Set B: {cat, mouse} MH-signature = cat

Step 1: Min-hash Round 2: ordering = [banana, mouse, cat, dog] Set A: {mouse, dog} MH-signature = mouse Set B: {cat, mouse} MH-signature = mouse

Step 2: LSH • Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3) • For i = 1...l • Randomly select k min-hash indices and concatenate them to form i’th LSH signature

Step 2: LSH • Generate candidate pair if u and v have an LSH signature in common in any round • Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k

Step 2: LSH Set A: {mouse, dog, horse, ant} MH1 = horse MH2 = mouse MH3 = ant MH4 = dog LSH134 = horse-ant-dog LSH234 = mouse-ant-dog Set B: {cat, ice, shoe, mouse} MH1 = cat MH2 = mouse MH3 = ice MH4 = shoe LSH134 = cat-ice-shoe LSH234 = mouse-ice-shoe

Step 2: LSH • Bottom line - probability of collision: • 10% similarity  0.1% • 1% similarity  0.0001%

Step 2: LSH Round 1 sports.com golf.com party.com music.com opera.com . . . . . . sing.com sport- team- win music- sound- play sing- music- ear

Step 2: LSH Round 2 sports.com golf.com music.com sing.com . . . . . . opera.com game- team- score audio- music- note theater- luciano- sing

Sort & Filter • Using all buckets from all LSH rounds, generate candidate pairs • Sort candidate pairs on first field • Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MH-signatures • Ready for “What’s Related?” queries...

Clustering • The set of document pairs represents the document-document similarity matrix with 20% similarity threshold • Clustering algorithms • S-Link: connected components • C-Link: maximal cliques • Center: approximation to C-Link

Center • Scan through pairs (they are sorted on first component) • For each run [(u, v1), ... , (u, vn)] • if u is not marked • cluster = u + unmarked neighbors of u • mark u and all neighbors of u

Center

Results 20 Million urls on Pentium-II 450

Sample Cluster feynman.princeton.edu/~sondhi/205main.html hep.physics.wisc.edu/wsmith/p202/p202syl.html hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html pdg.lbl.gov/mc_particle_id_contents.html physics.ucsc.edu/courses/10.html town.hall.org/places/SciTech/qmachine www.as.ua.edu/physics/hetheory.html www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html www.phy.duke.edu/Courses/271/Synopsis.html . . . (total of 27 urls) . . .

Ongoing/Future Work • Tune anchor-window length • Develop system to measure quality • What is ground truth? • How do you judge clustering of millions of pages?

Scalable Techniques for Clustering the Web

Scalable Techniques for Clustering the Web

Presentation Transcript

Scalable video distribution techniques

Data Mining Techniques Clustering

Clustering for web documents

Clustering Techniques

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

Search Techniques for the Web

Scalable Clustering using Multiple GPUs

Scalable, Behavior-Based Malware Clustering

Scalable Data Clustering with GPUs

Visualization for Classification and Clustering Techniques

Exploiting Clustering Techniques for Web Session Inference

Scalable Clustering for Vision using GPUs

Scalable Clustering on the Data Grid

Scalable Web Architectures

Clustering the Tagged Web

Scalable Web Server Clustering Technologies

A Comparison of Load Balancing Techniques for Scalable Web Servers

Clustering techniques {week 03b}

Clustering Techniques and IR

Visualization for Classification and Clustering Techniques

Scalable Clustering for Vision using GPUs