260 likes | 355 Views
Predictive Parallelization: Taming Tail Latencies in Web Search. Myeongjae Jeon , Saehoon Kim, Seung -won Hwang , Yuxiong He, Sameh Elnikety , Alan L. Cox, Scott Rixner Microsoft Research , POSTECH , Rice University. Performance of Web Search. 1) Query response time
E N D
Predictive Parallelization: Taming Tail Latencies in Web Search MyeongjaeJeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, SamehElnikety, Alan L. Cox, Scott Rixner Microsoft Research, POSTECH, Rice University
Performance of Web Search 1) Query response time • Answer quickly to users (e.g., in 300 ms) 2) Response quality (relevance) • Provide highly relevant web pages • Improve with resources and time consumed Focus: Improving response timewithout compromising quality
Background: Query Processing Stages Focus: Stage 1 Query 100s – 1000s of good matching docs doc Doc. index search For example:300 ms latency SLA 10s of the best matching docs 2nd phase ranking Few sentences for each doc Snippet generator Response
Goal Query doc Doc. index search For example:300 ms latency SLA 2nd phase ranking Snippet generator Response Speeding up index search (stage 1) without compromising result quality • Improve user experience • Larger index serving • Sophisticated 2nd phase
How Index Search Works Query • Partition all web pages across index servers (massively parallel) • Distribute query processing (embarrassingly parallel) • Aggregate top-k relevant pages Pages Aggregator Top-k pages Top-k pages Top-k pages Top-k pages Top-k pages Top-k pages Index server Index server Index server Index server Index server Index server Problem: A slow server makes the entire cluster slow Partition Partition Partition Partition Partition Partition All web pages
Observation • Query processing on every server. Response time is determined by the slowest one. • We need to reduce its tail latencies Latency
Examples Fast response Slow response Aggregator Aggregator Index servers Index servers • Terminate long query in the middle of processing • → Fast response, but quality drop Long query (outlier)
Parallelism for Tail Reduction Opportunity Challenge Tails are few Tails are very long • Available idle cores • CPU-intensive workloads Latency distribution Latency breakdown for the 99%tile.
Predictive Parallelism for Tail Reduction • Short queries • Many • Almost no speedup • Long queries • Few • Good speedup
Predictive Parallelization Workflow Index server Execution time predictor query Predict (sequential) execution time of the query with high accuracy
Predictive Parallelization Workflow Index server Execution time predictor query Resource manager long short • Using predicted time, selectively parallelize long queries
Predictive Parallelization • Focus of Today’s Talk • Predictor: of long query through machine learning • Parallelization: of long query with high efficiency
Brief Overview of Predictor In our workload, 4% queries with > 80 ms At least 3% must be identified (75% recall) Prediction overhead of 0.75ms or less and high precision Existing approaches: Lower accuracy and higher cost
Accuracy: Predicting Early Termination Lowest Docs sorted by static rank Highest • Only some limited portion contributes to top-k relevant results • Such portion depends on keyword (or score distribution more exactly) Doc 1 Doc 2 Doc 3 ……. Doc N-2 Doc N-1 Doc N Web documents Inverted index for “SIGIR” ……. ……. Processing Not evaluated
Space of Features • Term Features [Macdonald et al., SIGIR 12] • IDF, NumPostings • Score (Arithmetic, Geometric, Harmonic means, max, var, gradient) • Query features • NumTerms (before and after rewriting) • Relaxed • Language
New Features: Query • Rich clues from queries in modern search engines <Fields related to query execution plan> rank=BM25F enablefresh=1 partialmatch=1 language=en location=us …. <Fields related to search keywords> SIGIR (Queensland or QLD)
Space of Features • Term Features [Macdonald et al., SIGIR 12] • IDF, NumPostings • Score (Arithmetic, Geometric, Harmonic means, max, var, gradient) • Query features • NumTerms (before and after rewriting) • Relaxed • Language
Space of Features • All features cached to ensure responsiveness (avoiding disk access) • Term features require 4.47GB memory footprint (for 100M terms)
Feature Analysis and Selection • Accuracy gain from boosted regression tree, suggesting cheaper subset
Prediction Performance • Query features are important • Using cheap features is advantageous • IDF from keyword features + query features • Much smaller overhead (90+% less) • Similarly high accuracy as using all features A = actual long queries P = predicted long queries
Algorithms • Classification vs. Regression • Comparable accuracy • Flexibility • Algorithms • Linear regression • Gaussian process regression • Boosted regression tree
Accuracy of Algorithms • Summary • 80% long queries (> 80 ms) identified • 0.6% short queries mispredicted • 0.55 ms for prediction time with low memory overhead
Predictive Parallelism • Key idea • Parallelize only long queries • Use a threshold on predicted execution time • Evaluation • Compare Predictive to other baselines • Sequential • Fixed • Adaptive
99%tile Response Time 50% throughput increase • Outperforms “Parallelize all”
Related Work • Search query parallelism • Fixed parallelization [Frachtenberg, WWWJ 09] • Adaptive parallelization using system load only [Raman et al., PLDI 11] High overhead due to parallelizing all queries • Execution time prediction • Keyword-specific features only [Macdonald et al., SIGIR 12] → Lower accuracy and high memory overhead for our target problem
Thank You! Your query to Bing is now parallelized if predicted as long. Execution time predictor query Resource manager long short