Pairwise Document Similarity in Large Collections with MapReduce

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab ACL, June 2008

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem • Applications: • Clustering • Coreference resolution • “more-like-that” queries Pairwise Document Similarity in Large Collections with MapReduce

Trivial Solution • load each vector o(N) times • load each term o(dft2) times Goal scalable and efficient solutionfor large collections Pairwise Document Similarity in Large Collections with MapReduce

Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores Pairwise Document Similarity in Large Collections with MapReduce

MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input handles low-level detailstransparently Pairwise Document Similarity in Large Collections with MapReduce

Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores reduce map Pairwise Document Similarity in Large Collections with MapReduce

Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc Pairwise Document Similarity in Large Collections with MapReduce

Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 Pairwise Document Similarity in Large Collections with MapReduce

2 2 2 1 2 1 3 1 2 2 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama 1 1 Pairwise Document Similarity in Large Collections with MapReduce

Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs Shuffling group values by: pairs multiply term postings sum similarity multiply term postings sum similarity multiply term postings sum similarity multiply term postings Pairwise Document Similarity in Large Collections with MapReduce

Experimental Setup • 0.16.0 • Open source MapReduce implementation • Cluster of 19 machines • Each w/ two processors (single core) • Aquaint-2 collection • 906K documents • Okapi BM25 • Subsets of collection Pairwise Document Similarity in Large Collections with MapReduce

Efficiency (disk space) Aquaint-2 Collection, ~ 906k docs 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce

Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank Pairwise Document Similarity in Large Collections with MapReduce

Efficiency (disk space) Aquaint-2 Collection, ~ 906k doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce

Effectiveness (recent work) Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce

Ivory • Open source implementation • Java 1.5, 0.16.0 • Available soon … Pairwise Document Similarity in Large Collections with MapReduce

Conclusion • Simple and efficient MapReduce solution • Many HLT problems can also be “hadoopified” • E.g., Statistical MT (see paper in StatMT workshop) • Shuffling is critical • df-cut controls efficiency vs. effectiveness tradeoff • 99.9% df-cut achieves 98% relative accuracy Pairwise Document Similarity in Large Collections with MapReduce

Future work • Apply to larger collections! • Develop analytical model • Measure effectiveness for different applications Pairwise Document Similarity in Large Collections with MapReduce

Thank You! Pairwise Document Similarity in Large Collections with MapReduce

Algorithm • Matrix must fit in memory • Works for small collections • Otherwise: disk access optimization Pairwise Document Similarity in Large Collections with MapReduce

Pairwise Document Similarity in Large Collections with MapReduce