210 likes | 562 Views
Pairwise Document Similarity in Large Collections with MapReduce. Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
E N D
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab ACL, June 2008
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem • Applications: • Clustering • Coreference resolution • “more-like-that” queries Pairwise Document Similarity in Large Collections with MapReduce
Trivial Solution • load each vector o(N) times • load each term o(dft2) times Goal scalable and efficient solutionfor large collections Pairwise Document Similarity in Large Collections with MapReduce
Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores Pairwise Document Similarity in Large Collections with MapReduce
MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input handles low-level detailstransparently Pairwise Document Similarity in Large Collections with MapReduce
Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores reduce map Pairwise Document Similarity in Large Collections with MapReduce
Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc Pairwise Document Similarity in Large Collections with MapReduce
Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 Pairwise Document Similarity in Large Collections with MapReduce
2 2 2 1 2 1 3 1 2 2 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama 1 1 Pairwise Document Similarity in Large Collections with MapReduce
Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs Shuffling group values by: pairs multiply term postings sum similarity multiply term postings sum similarity multiply term postings sum similarity multiply term postings Pairwise Document Similarity in Large Collections with MapReduce
Experimental Setup • 0.16.0 • Open source MapReduce implementation • Cluster of 19 machines • Each w/ two processors (single core) • Aquaint-2 collection • 906K documents • Okapi BM25 • Subsets of collection Pairwise Document Similarity in Large Collections with MapReduce
Efficiency (disk space) Aquaint-2 Collection, ~ 906k docs 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce
Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”) 3% most frequent 10 terms 15% most frequent 100 terms 57% most frequent 1000 terms 95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank Pairwise Document Similarity in Large Collections with MapReduce
Efficiency (disk space) Aquaint-2 Collection, ~ 906k doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce
Effectiveness (recent work) Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce
Ivory • Open source implementation • Java 1.5, 0.16.0 • Available soon … Pairwise Document Similarity in Large Collections with MapReduce
Conclusion • Simple and efficient MapReduce solution • Many HLT problems can also be “hadoopified” • E.g., Statistical MT (see paper in StatMT workshop) • Shuffling is critical • df-cut controls efficiency vs. effectiveness tradeoff • 99.9% df-cut achieves 98% relative accuracy Pairwise Document Similarity in Large Collections with MapReduce
Future work • Apply to larger collections! • Develop analytical model • Measure effectiveness for different applications Pairwise Document Similarity in Large Collections with MapReduce
Thank You! Pairwise Document Similarity in Large Collections with MapReduce
Algorithm • Matrix must fit in memory • Works for small collections • Otherwise: disk access optimization Pairwise Document Similarity in Large Collections with MapReduce