1 / 20

Pairwise Document Similarity in Large Collections with MapReduce

Pairwise Document Similarity in Large Collections with MapReduce. Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

amaggie
Download Presentation

Pairwise Document Similarity in Large Collections with MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park Human Language Technology Center of Excellence and UMIACS CLIP Lab ACL, June 2008

  2. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem • Applications: • Clustering • Coreference resolution • “more-like-that” queries Pairwise Document Similarity in Large Collections with MapReduce

  3. Trivial Solution • load each vector o(N) times • load each term o(dft2) times Goal scalable and efficient solutionfor large collections Pairwise Document Similarity in Large Collections with MapReduce

  4. Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores Pairwise Document Similarity in Large Collections with MapReduce

  5. MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input handles low-level detailstransparently Pairwise Document Similarity in Large Collections with MapReduce

  6. Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2) partial scores reduce map Pairwise Document Similarity in Large Collections with MapReduce

  7. Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc Pairwise Document Similarity in Large Collections with MapReduce

  8. Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 Pairwise Document Similarity in Large Collections with MapReduce

  9. 2 2 2 1 2 1 3 1 2 2 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama 1 1 Pairwise Document Similarity in Large Collections with MapReduce

  10. Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs Shuffling group values by: pairs multiply term postings sum similarity multiply term postings sum similarity multiply term postings sum similarity multiply term postings Pairwise Document Similarity in Large Collections with MapReduce

  11. Experimental Setup • 0.16.0 • Open source MapReduce implementation • Cluster of 19 machines • Each w/ two processors (single core) • Aquaint-2 collection • 906K documents • Okapi BM25 • Subsets of collection Pairwise Document Similarity in Large Collections with MapReduce

  12. Efficiency (disk space) Aquaint-2 Collection, ~ 906k docs 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce

  13. Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank Pairwise Document Similarity in Large Collections with MapReduce

  14. Efficiency (disk space) Aquaint-2 Collection, ~ 906k doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce

  15. Effectiveness (recent work) Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk Pairwise Document Similarity in Large Collections with MapReduce

  16. Ivory • Open source implementation • Java 1.5, 0.16.0 • Available soon … Pairwise Document Similarity in Large Collections with MapReduce

  17. Conclusion • Simple and efficient MapReduce solution • Many HLT problems can also be “hadoopified” • E.g., Statistical MT (see paper in StatMT workshop) • Shuffling is critical • df-cut controls efficiency vs. effectiveness tradeoff • 99.9% df-cut achieves 98% relative accuracy Pairwise Document Similarity in Large Collections with MapReduce

  18. Future work • Apply to larger collections! • Develop analytical model • Measure effectiveness for different applications Pairwise Document Similarity in Large Collections with MapReduce

  19. Thank You! Pairwise Document Similarity in Large Collections with MapReduce

  20. Algorithm • Matrix must fit in memory • Works for small collections • Otherwise: disk access optimization Pairwise Document Similarity in Large Collections with MapReduce

More Related