On Large-Scale Retrieval Tasks with Ivory and MapReduce

On Large-Scale Retrieval Taskswith Ivory and MapReduce Nov 7th, 2012

My Field … Information Retrieval (IR) is … Finding material(usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections • Quite effective (at some things) • Highly visible (mostly) • Commercially successful (some of them)

IR is not just “Document Retrieval” • Clustering and Classification • Question answering • Filtering, tracking, routing • Recommender systems • Leveraging XML and other Metadata • Text mining • Novelty identification • Meta-search (multi-collection searching) • Summarization • Cross-language mechanisms • Evaluation techniques • Multimedia retrieval • Social media analysis • …

My Research … emails Text web pages + Enron Large-ScaleProcessing CLuE Web ~500,000 IdentityResolution ~1,000,000,000 User Application WebSearch

Back in 2009 … • Before 2009, small text collections are available • Largest: ~ 1M documents • ClueWeb09 • Crawled by CMU in 2009 • ~ 1B documents ! • need to move to cluster environments • MapReduce/Hadoopseems like promising framework

Ivory • E2E Search Toolkit using MapReduce • Completely designed for the Hadoop environment • Experimental Platform for research • Supports common text collections • + ClueWeb09 • Open source release • Implements state-of-the-art retrieval models http://ivory.cc

MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input Framework handles “everything else” !

The IR Black Box Documents Query Hits

Inside the IR Black Box Documents Query online offline Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits

Indexing Collection Inverted Index A, 2 B, 1 Documents, IDs Terms, Posting Lists C, 1 A Clinton Obama Clinton A, 1 C, 1 Clinton B Clinton Cheney B, 1 C, 1 Obama C Clinton Barack Obama Cheney Barack

Indexing Collection Inverted Index A, 2 B, 1 Documents, IDs Terms, Posting Lists C, 1 A Clinton Obama Clinton A, 1 C, 1 Clinton B Clinton Romney B, 1 C, 1 Obama C Clinton Barack Obama Romney Barack

reduce reduce reduce reduce map map map Indexing (a) Map (b) Shuffle (c) Reduce A Shuffling Clinton Clinton Obama Clinton Clinton ObamaClinton 2 Clinton Clinton Obama Clinton B ClintonRomney Clinton Romney Obama Obama Clinton Romney Romney Romney C Clinton Clinton Barack Obama ClintonBarackObama Barack Barack Barack Obama

Retrieval Directly from HDFS! Search Client RetrievalBroker • Cute hack: use Hadoop to launch partition servers • Embed an HTTP server inside each mapper • Mappers start up, initialize servers, enter into infinite service loop! • Why do this? • Unified Hadoopecosystem • Simplifies data management issues PartitionServer PartitionServer PartitionServer PartitionServer HDFSnamenode Local Disk HDFSdatanode HDFSdatanode HDFSdatanode HDFSdatanode TREC’09 TREC’10

Roadmap CIKM 2011 ACL 2008 SIGIR 2011 Ivory SIGIR 2011 TREC 2009 TREC 2010 CloudCom 2011

Roadmap ACL 2008 SIGIR 2011 Ivory

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem • Applications: • Clustering • Coreference resolution • “more-like-that” queries

Decomposition Each term contributes only if appears in reduce map

2 2 2 1 2 2 1 1 2 3 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Romney 1 Barack 1 Obama 1 1

Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank

Efficiency (disk space) Aquaint-2 Collection, ~ 906k doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

ACL’08 Effectiveness Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Cross-Lingual Pairwise Similarity • Find similar document pairs in different languages • Multilingual text mining, Machine Translation • Application: automatic generation of potential “interwiki” language links More difficult than monolingual!

Vocabulary Space Matching German English MT MT translate Doc A doc vector A English Doc B doc vector B German CLIR CLIR project Doc A doc vector A doc vector A English Doc B doc vector B

Locality-Sensitive Hashing (LSH) • Cosine score is a good similarity measure but expensive! • LSH is a method for effectively reducing the search space when looking for similar pairs • Each vector is converted into a compact representation, called a signature • A sliding window-based algorithm uses these signatures to search for similar articles in the collection Vectors close to each other are likely to have similar signatures

Solution Overview Nf German articles Ne English articles CLIRprojection Preprocess Similar article pairs Ne+Nf English document vectors <nobel=0.324, prize=0.227, book=0.01, …> 11100001010 01110000101 Sliding window algorithm Signature generation Ne+Nf Signatures Random Projection/ Minhash/Simhash

MapReduce 1: Table Generation Phase tables permute S1 S1’ sort p1 …. 11111101010 10011000110 01100100100 … …. 01100100100 10011000110 11111101010 … Signatures . . . . . . …. 11011011101 01110000101 10101010000 … SQ SQ’ sort pQ …. 11111001011 00101001110 10010000101 … …. 00101001110 10010000101 11111001011 …

MapReduce 2: Detection Phase table chunks 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011

Evaluation • Ground truth: • Sample 1064 German articles • cosine score >= 0.3 • Compare sliding window with brute force approach • required for exact solution • good reference as an upper-bound for recall and running time

Evaluation 95% recall at 39% cost 99% recall at 62% cost No Free Lunch!

SIGIR’11 Contribution to Wikipedia • Identify links between German and English Wikipedia articles • “Metadaten”  “Metadata”, “Semantic Web”, “File Format” • “Pierre Curie”  “Marie Curie”, “Pierre Curie”, “Helene Langevin-Joliot” • “Kirgisistan”  “Kyrgyzstan”, “Tulip Revolution”, “2010 Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan” • Bad results when significant difference in length.

Roadmap CIKM 2011 Ivory

Approximate Positional Indexes “Learning to Rank” models Approximate effective ranking functions √ Termpositions Proximity features Learn Largeindex Slow query evaluation X X Smaller index Faster query evaluation √ √ Close Enough is Good Enough?

Variable-Width Buckets • 5 buckets / document 1 1 2 3 4 5 2 3 4 5 d2 d1 ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........….

Fixed-Width Buckets • Buckets of length W 1 2 3 1 2 3 4 5 d1 d2 ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........….

CIKM’11 Effectiveness

Roadmap Ivory iHadoop SIGIR ‘11

Test Collections • Documents, queries, and relevance judgments • Important driving force behind IR innovation • Without test collections, it’s impossible to: • Evaluate search systems • Tune ranking functions / train models • Traditional • Exhaustive • Pooling • Recent Methodologies • Behavioral logging (query logs, click logs, etc.) • Minimal test collections • Crowdsourcing

Web Graph P3 P1 web search P6 SIGIR 2012 P2 P4 web search P5 web search P7 web search Google web search

Queries and Judgments? P3 P1 anchor text lines ≈ pseudo queries SIGIR 2012 P5 target pages ≈ relevant candidates Bing Google P6 P2 web search P4 P7 noise reduction ?

SIGIR’11

Roadmap Ivory CloudCom 2011

Iterative MapReduce Applications • Many machine learning, and data mining applications • PageRank, k-means, HITS, … • Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time) • Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth) MapReduce is not designed to run iterative applications efficiently

Goal

CloudCom’11 Asynchronous Pipeline

Conclusion • MapReduce allows large-scale processing over web data • Ivory • E2E open-source IR retrieval engine for research • Completely on Hadoop • even retrieval: from HDFS • Efficiency-effectiveness tradeoff • Cross-Lingual Pairwise Similarity • Efficient implementation using MapReduce • Efficiency-effectiveness tradeoff • ApproxPositional Indexes • Efficient and as effective as exact positions • Pseudo Test Collections • Possible! • Effective for training L2R models • MapReduce is not good for iterative algorithms http://ivory.cc

Collaborators • Jimmy Lin • Don Metzler • Doug Oard • Ferhan Ture • Nima Asadi • Lidan Wang • EslamElnikety • Hany Ramadan

Thank You! Questions?

On Large-Scale Retrieval Tasks with Ivory and MapReduce