1 / 21

ApproxHadoop Bringing Approximations to MapReduce Frameworks

This paper introduces ApproxHadoop, an implementation for Hadoop that allows for approximate computing in MapReduce frameworks. It presents approximation mechanisms, error bounds based on statistical theories, and demonstrates how ApproxHadoop can achieve target error bounds online, leading to significant time and energy savings with high accuracy.

aubreyj
Download Presentation

ApproxHadoop Bringing Approximations to MapReduce Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ApproxHadoopBringing Approximations to MapReduce Frameworks Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen

  2. Approximate computing • We’re producing more data than we can analyze • Many applications do not require precise outputs • Being precise is expensive • Approximate computation • Time and/or energy vs. accuracy TB Data warehouse growth growth rate= 173% Technology scaling [IEEE Design 2014]

  3. Data analytics using MapReduce • Example: Process web access logs to extract top pages • MapReduce is a popular framework • User provides code (map and reduce) • Framework manages data access and parallel execution • Higher level languages on top: Pig, Hive,… • Hadoop is deployed widely at large scale • Facebook: 30PB Hadoop clusters • Yahoo: 16 Hadoop clusters >42000 nodes Yahoo Computing Coop

  4. Our contributions • Approximations in MapReduce • Approximation mechanisms • Error bounds based on statistical theories • ApproxHadoop: implementation for Hadoop • Approximate common applications • Achieve target error bounds online • Large execution time and energy savings with high accuracy

  5. Approximations in MapReduce Why can we approximate with MapReduce? Block 1 Map 1 Lines in a block have similarities Block 2 Map 2 Reduce 1 Output 1 Block 3 Map 3 Reduce 2 Output 2 Blocks have similarities Block 4 Map 4 Example application: What is the average length of the lines of each color?

  6. Mechanisms and error bounds • Similarities allow for accurate approximations • Approximation mechanisms for MapReduce: • Drop map tasks • Sample input data • User-defined approximations (technical report) • Bound approximation errors using: • Multistage sampling for aggregation applications (e.g., sum, average, ratio) • Extreme value theory for extreme value computations (e.g., min, max)

  7. Multistage sampling and MapReduce • Combines inter/intra-cluster sampling techniques • Simple random sampling: inside a block → Data sampling • Cluster sampling: between blocks → Task dropping • Given sampling/dropping ratios and variances • Compute error bounds with confidence level Population Cluster

  8. Mapping multistage sampling to MapReduce Block → Cluster Track sampling ratios Intra cluster sampling (data sampling) Block 1 Map 1 Block 2 Map 2 Reduce 1 Output 1 Inter cluster sampling (task dropping) Block 3 Map 3 Reduce 2 Output 2 Block 4 Map 4 Y±X% Population Use inter/intra variances for each line color Approximation with error bounds Example application: What is the approximate average length of the lines of each color?

  9. Our contributions • Approximations in MapReduce • Approximation mechanisms • Error bounds based on statistical theories • ApproxHadoop: implementation for Hadoop • Approximate common applications • Achieve target error bounds online • Large execution time and energy savings with high accuracy

  10. Example: Using ApproxHadoop classWordCount: classWCMapperextends Mapper: void map(String key, String value): foreachword w in value: context.write(w, 1); classWCReducerextends Reducer: void reduce(String key, Iterator values): intresult = 0; foreachintv in values: result += v; context.write(key, result); void main(): setInputFormat(TextInputFormat); run(); classApproxWordCount: classApproxWCMapperextendsMultiStageSamplingMapper: void map(String key, String value): foreachword w in value: context.write(w, 1); classApproxWCReducerextendsMultiStageSamplingReducer: void reduce(String key, Iterator values): intresult = 0; foreachintv in values: result += v; context.write(key, result); void main(): setInputFormat(ApproxTextInputFormat); run();

  11. How to specify approximations? • User specifies the dropping/sampling ratios • ApproxHadoop calculates the error bound • User specifies the target error bound • Example: maximum error (±1%) with a confidence level (95% confidence) • ApproxHadoop: No Run first subset of tasks Select dropping/ sampling ratios Run next subset of tasks Calculate final error bounds Target bound? Yes

  12. Implementation: ApproxHadoop • Extends Hadoop 1.2.1 • Implements approximation mechanisms • Extended reducers • Bound estimation • Incremental reducers • Tune sampling ratios • New data types • ApproxInteger Block 1 Map 1 Y±X% Block 2 Map 2 Reduce 1 Output 1 Block 2 Map 2 Block 3 Map 3 Reduce 2 Output 2 Block 3 Map 3 Block 4 Map 4

  13. Evaluation methodology • Datasets • Wikipedia access logs: 1 week with 4 billion accesses for 216.9GB • Wikipedia articles: 40GB in XML • Other applications and datasets in the paper • Metrics • Actual % error (approximation vs precise) • Approximation with 95% confidence interval (e.g., 10±1%) • Run time • 20 runs reporting min, max and average • Executions on 10- and 60-node clusters

  14. Example: Precise and approximate processing Wikipedia project popularity 1% sampling Wikipedia article length 1% sampling 1% input sampling introduces different errors in different applications Actual values within bounds

  15. User-specified input sampling ratio Wikipedia project popularity not dropping More than 30% run time reduction for less than 0.1% ratio Applications exhibit different speedups for the same ratios

  16. User-specified dropping/sampling ratios Wikipedia project popularity 25% task dropping More than 55% run time reduction for less than 1% error Task dropping increases errors significantly but decreases run time too

  17. User-specified target error Wikipedia project popularity ApproxHadoop tunes the sampling/dropping ratios depending on target Input data sampling No sampling Maximum sampling Task dropping

  18. Impact of input data size Wikipedia project popularity from 1 day (27GB) to 1 year (12.5TB) Runtime (seconds) Compressed log size (in GB) Larger input data brings larger savings (up to 32x)

  19. Conclusions • Apply statistical theories to MapReduce • Approximation mechanisms, such as input data sampling and task dropping • Applicable to (large) classes of analytics applications • Achieve target error bounds online with ApproxHadoop • Tradeoff between execution time and accuracy • Significant execution time reduction with high accuracy • Scales well for large datasets

  20. ApproxHadoopBringing Approximations to MapReduce Frameworks Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen

More Related