210 likes | 325 Views
Meeting Service Level Objectives of Pig Programs. Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard Labs. Advantages Large amount of resources Elasticity Pay-as-you-go pricing model Challenges Distributed resources
E N D
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon ThauLoo University of Pennsylvania Hewlett-Packard Labs
Advantages • Large amount of resources • Elasticity • Pay-as-you-go pricing model • Challenges • Distributed resources • Error-prone Cloud Environment
MapReduce and Pig • MapReduce: Simple and fault tolerant framework for data processing in the cloud • Pig • Advanced MapReduce based platform • Widely used: Yahoo!, Twitter, LinkedIn • PigLatin: A high-level declaratice language for expressing data analysis tasks as Pig programs j2 j4 j6 j7 j1 j3 j5
Motivation • Latency-sensitive applications • Personalized advertising • Spam and fraud detection • Real-time log analysis • How much resource does an application need to meet their deadlines?
Contributions • Performance modeling for Pig programs • Given a Pig grogram, estimates its completion time as a function of assigned resource • Deadline driven resource allocation estimates for Pig programs • Given a completion time target, determine the amount of resources for a Pig program to achieve it
Outline • Introduction • Building block • Performance model for single MapReduce jobs • Resource allocation for Pig programs • Evaluation • Conclusion and ongoing work
Theoretical Makespan Bounds • Bounds- based makespan estimates • n tasks, k servers • avg: average duration of the n tasks • max: maximum duration of the n tasks • Lower bound • Upper bound
Illustration Schedule 1:1432312 1 2 Makespan = 4 Lower bound = 4 3 4 Schedule 2:3123214 1 Makespan = 7 Upper bound = 8 2 3 4
Estimate Completion Timefor Single MR Job • Estimate the bounds of the job completion time based on job profile • Most production jobs are executed routinely on new data sets • Job profile based on previous running • Map stage: Mavg, Mmax, AvgInputSize, Selectivity • Reduce stage: Shavg, Shmax, Ravg, Rmax, Selectivity • Predict the completion time for future running with the profile
Estimate CompletionTime for Single MR Job • Estimating bounds on the duration of map and reduce stages • Map stage duration depends on: • NM -- the number of map tasks • SM -- the number of map slots • Reduce stage duration depends on: • NR -- the number of reduce tasks • SR -- the number of reduce slots • Job duration TJlow,TJup , Tjavg • Sum of the map and reduce stage duration
Resource Allocation for Single MR Job • Given a deadline D and the job profile, find the minimal resource to complete the job within D Given number of map/reduce tasks Statistics from job profile Find the value of SMJ, SRJwith minimum value of SMJ+ SRJusing Lagrange's multipliers
Outline • Introduction • Building block • Performance model for single MapReduce jobs • Resource allocation for Pig programs • Evaluation • Conclusion and ongoing work
Performance Model for Pig Programs • Let P = {J1, J2,….JN } , extract the job profile of each job contained in P • Assign unique name for each job within a program • The program completion time sum of the completion time of all the jobs contained in P
Resource Allocation for Pig Programs • Possible strategy: find outan appropriate pair of map and reduce slots for each job in the program • Problem: difficult to implement and manage by the scheduler with
Resource Allocation for Pig Programs • A simpler and more elegant solution • Allocate the same set of resource to the entire program instead of to each job • Rewrite the previous equations into Find the minimum set of map and reduce slots ( SMP , SRP ) for the entire Pig program
Experiment Setup • 66 nodes cluster in 2 racks • 4 AMD 2.39GHz cores • 8 GB RAM, • two 160GB hard disks • Configuration • 1 jobtracker, 1 namenode, 64 worker nodes • 2 map slots and 1 reduce slot for each node
Benchmark • Pigmix benchmark • 17 programs • 8 tables as the input data • Dataset • Test dataset • Generated with the Pig mix data generator • Total size around 1TB. • Experimental dataset • Same layout as the test dataset • 20% larger in size
Model Accuracy • How well of our performance model captures Pig program completion time? Normalized results for predicted and measured completion time
Meeting Deadlines • Are we meeting deadlines with our resource allocation mode? Pigmix executed on experimental data set : do we meet deadlines?
Conclusion • Conclusion • The performance model can accurately estimate the completion time of MapReduce workflow • Enables automatic resource provisioning for MapReduce workflow with deadlines • Ongoing work • Refine the performance model for workflow with concurrent jobs • Incorporating failure scenarios in the current model