470 likes | 608 Views
On Large-Scale Retrieval Tasks with Ivory and MapReduce. Nov 7 th , 2012. My Field …. Information Retrieval (IR ) is … Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections
E N D
On Large-Scale Retrieval Taskswith Ivory and MapReduce Nov 7th, 2012
My Field … Information Retrieval (IR) is … Finding material(usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections • Quite effective (at some things) • Highly visible (mostly) • Commercially successful (some of them)
IR is not just “Document Retrieval” • Clustering and Classification • Question answering • Filtering, tracking, routing • Recommender systems • Leveraging XML and other Metadata • Text mining • Novelty identification • Meta-search (multi-collection searching) • Summarization • Cross-language mechanisms • Evaluation techniques • Multimedia retrieval • Social media analysis • …
My Research … emails Text web pages + Enron Large-ScaleProcessing CLuE Web ~500,000 IdentityResolution ~1,000,000,000 User Application WebSearch
Back in 2009 … • Before 2009, small text collections are available • Largest: ~ 1M documents • ClueWeb09 • Crawled by CMU in 2009 • ~ 1B documents ! • need to move to cluster environments • MapReduce/Hadoopseems like promising framework
Ivory • E2E Search Toolkit using MapReduce • Completely designed for the Hadoop environment • Experimental Platform for research • Supports common text collections • + ClueWeb09 • Open source release • Implements state-of-the-art retrieval models http://ivory.cc
MapReduce Framework (a) Map (b) Shuffle (c) Reduce (k1, v1) [k2, v2] Shuffling group values by: [keys] [(k3, v3)] map (k2, [v2]) input reduce output map input reduce output map input reduce output map input Framework handles “everything else” !
The IR Black Box Documents Query Hits
Inside the IR Black Box Documents Query online offline Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits
Indexing Collection Inverted Index A, 2 B, 1 Documents, IDs Terms, Posting Lists C, 1 A Clinton Obama Clinton A, 1 C, 1 Clinton B Clinton Cheney B, 1 C, 1 Obama C Clinton Barack Obama Cheney Barack
Indexing Collection Inverted Index A, 2 B, 1 Documents, IDs Terms, Posting Lists C, 1 A Clinton Obama Clinton A, 1 C, 1 Clinton B Clinton Romney B, 1 C, 1 Obama C Clinton Barack Obama Romney Barack
reduce reduce reduce reduce map map map Indexing (a) Map (b) Shuffle (c) Reduce A Shuffling Clinton Clinton Obama Clinton Clinton ObamaClinton 2 Clinton Clinton Obama Clinton B ClintonRomney Clinton Romney Obama Obama Clinton Romney Romney Romney C Clinton Clinton Barack Obama ClintonBarackObama Barack Barack Barack Obama
Retrieval Directly from HDFS! Search Client RetrievalBroker • Cute hack: use Hadoop to launch partition servers • Embed an HTTP server inside each mapper • Mappers start up, initialize servers, enter into infinite service loop! • Why do this? • Unified Hadoopecosystem • Simplifies data management issues PartitionServer PartitionServer PartitionServer PartitionServer HDFSnamenode Local Disk HDFSdatanode HDFSdatanode HDFSdatanode HDFSdatanode TREC’09 TREC’10
Roadmap CIKM 2011 ACL 2008 SIGIR 2011 Ivory SIGIR 2011 TREC 2009 TREC 2010 CloudCom 2011
Roadmap ACL 2008 SIGIR 2011 Ivory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem • Applications: • Clustering • Coreference resolution • “more-like-that” queries
Decomposition Each term contributes only if appears in reduce map
2 2 2 1 2 2 1 1 2 3 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Romney 1 Barack 1 Obama 1 1
Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”) 3% most frequent 10 terms 15% most frequent 100 terms 57% most frequent 1000 terms 95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank
Efficiency (disk space) Aquaint-2 Collection, ~ 906k doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
ACL’08 Effectiveness Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Cross-Lingual Pairwise Similarity • Find similar document pairs in different languages • Multilingual text mining, Machine Translation • Application: automatic generation of potential “interwiki” language links More difficult than monolingual!
Vocabulary Space Matching German English MT MT translate Doc A doc vector A English Doc B doc vector B German CLIR CLIR project Doc A doc vector A doc vector A English Doc B doc vector B
Locality-Sensitive Hashing (LSH) • Cosine score is a good similarity measure but expensive! • LSH is a method for effectively reducing the search space when looking for similar pairs • Each vector is converted into a compact representation, called a signature • A sliding window-based algorithm uses these signatures to search for similar articles in the collection Vectors close to each other are likely to have similar signatures
Solution Overview Nf German articles Ne English articles CLIRprojection Preprocess Similar article pairs Ne+Nf English document vectors <nobel=0.324, prize=0.227, book=0.01, …> 11100001010 01110000101 Sliding window algorithm Signature generation Ne+Nf Signatures Random Projection/ Minhash/Simhash
MapReduce 1: Table Generation Phase tables permute S1 S1’ sort p1 …. 11111101010 10011000110 01100100100 … …. 01100100100 10011000110 11111101010 … Signatures . . . . . . …. 11011011101 01110000101 10101010000 … SQ SQ’ sort pQ …. 11111001011 00101001110 10010000101 … …. 00101001110 10010000101 11111001011 …
MapReduce 2: Detection Phase table chunks 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011
Evaluation • Ground truth: • Sample 1064 German articles • cosine score >= 0.3 • Compare sliding window with brute force approach • required for exact solution • good reference as an upper-bound for recall and running time
Evaluation 95% recall at 39% cost 99% recall at 62% cost No Free Lunch!
SIGIR’11 Contribution to Wikipedia • Identify links between German and English Wikipedia articles • “Metadaten” “Metadata”, “Semantic Web”, “File Format” • “Pierre Curie” “Marie Curie”, “Pierre Curie”, “Helene Langevin-Joliot” • “Kirgisistan” “Kyrgyzstan”, “Tulip Revolution”, “2010 Kyrgyzstani uprising”, “2010 South Kyrgyzstan riots”, “Uzbekistan” • Bad results when significant difference in length.
Roadmap CIKM 2011 Ivory
Approximate Positional Indexes “Learning to Rank” models Approximate effective ranking functions √ Termpositions Proximity features Learn Largeindex Slow query evaluation X X Smaller index Faster query evaluation √ √ Close Enough is Good Enough?
Variable-Width Buckets • 5 buckets / document 1 1 2 3 4 5 2 3 4 5 d2 d1 ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........….
Fixed-Width Buckets • Buckets of length W 1 2 3 1 2 3 4 5 d1 d2 ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........…. ………...........….
CIKM’11 Effectiveness
Roadmap Ivory iHadoop SIGIR ‘11
Test Collections • Documents, queries, and relevance judgments • Important driving force behind IR innovation • Without test collections, it’s impossible to: • Evaluate search systems • Tune ranking functions / train models • Traditional • Exhaustive • Pooling • Recent Methodologies • Behavioral logging (query logs, click logs, etc.) • Minimal test collections • Crowdsourcing
Web Graph P3 P1 web search P6 SIGIR 2012 P2 P4 web search P5 web search P7 web search Google web search
Queries and Judgments? P3 P1 anchor text lines ≈ pseudo queries SIGIR 2012 P5 target pages ≈ relevant candidates Bing Google P6 P2 web search P4 P7 noise reduction ?
Roadmap Ivory CloudCom 2011
Iterative MapReduce Applications • Many machine learning, and data mining applications • PageRank, k-means, HITS, … • Every iteration has to wait until the previous iteration has written its output completely to the DFS (unnecessary waiting time) • Every iteration starts by reading from the DFS what has just been written by the earlier iteration (wastes CPU time, I/O, bandwidth) MapReduce is not designed to run iterative applications efficiently
CloudCom’11 Asynchronous Pipeline
Conclusion • MapReduce allows large-scale processing over web data • Ivory • E2E open-source IR retrieval engine for research • Completely on Hadoop • even retrieval: from HDFS • Efficiency-effectiveness tradeoff • Cross-Lingual Pairwise Similarity • Efficient implementation using MapReduce • Efficiency-effectiveness tradeoff • ApproxPositional Indexes • Efficient and as effective as exact positions • Pseudo Test Collections • Possible! • Effective for training L2R models • MapReduce is not good for iterative algorithms http://ivory.cc
Collaborators • Jimmy Lin • Don Metzler • Doug Oard • Ferhan Ture • Nima Asadi • Lidan Wang • EslamElnikety • Hany Ramadan
Thank You! Questions?