120 likes | 235 Views
Peek Partitioner: Leveraging Sampling to Improve TotalOrderPartitioner. Alex Edelsburg Eric Wheeler. Problems and Background. Sampling Harness TotalOrderPartitioner samples on K1 Accuracy and load balancing demands sampling on K2. Goals.
E N D
Peek Partitioner: Leveraging Sampling to Improve TotalOrderPartitioner Alex Edelsburg Eric Wheeler
Problems and Background • Sampling Harness • TotalOrderPartitioner samples on K1 • Accuracy and load balancing demands sampling on K2
Goals Use preliminary sampling job to profile normal operation of mappers and construct partitions Discover possible crossover points with runtime vs. sampling fraction
Results – Worst Case Naïve 559 seconds Sampler 778 seconds
Results – Best Case Naïve 109 Seconds Sampler 92 seconds
Results – Teragen (Duke) Naïve 77 Seconds Sampler 123 seconds
Results – Teragen (EC2) Naïve 395 Seconds Sampler 924 seconds
Conclusions Improved load balance Comparable runtimes Room at the bottom (bi-level sampling) Tend to do better with more reducers (parallelism)
Future Work Investigate other variable axes (row sampling fraction etc.) Improve code space efficiency Leverage sampling job to investigate mapper behavior Comparison against custom InputFormat/InputSampler Local/Cluster Mode