Survey on Programming and Tasking in Cloud Computing Environments

Survey on Programming and Tasking in CloudComputing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011

Outline • Introduction • Approaches • Application framework level approach • Language level approach • Instruction level approach • Our work: MRlite • Conclusion

Cloud computing • Internet services are the most popular applications nowadays • Millions of users • Computation is large and complex • Google already processed 20TB data in 2004 • Cloud computing provides massive computing resources • Available on demand A promising model to support processing large datasets housed on clusters

How to program and task? • Challenges • Parallelize the execution • Scheduling the large scale distributed computation • Handling faults • High performance • Ensuring fairness • Programming models for Grid • Do not automatically parallelize users’ programs • Pass the fault-tolerance work to applications

Outline • Introduction • Approaches • Application framework level approach • Language level approach • Instruction level approach • Our work: MRlite • Conclusion

Approaches Application framework level

MapReduce • MapReduce: parallel computing framework for large-scale data processing • Successful used in datacenters comprising commodity computers • A fundamental piece of software in the Google architecture for many years • Open source variant already exists: Hadoop • Widely used in solving data-intensive problems MapReduce Hadoop … Hadoop or variants …

MapReduce • Map and Reduce are higher-order functions • Map: apply an operation to all elements in a list • Reduce: Like “fold”; aggregate elements of a list 12 + 22 + 32 + 42 + 52 = ? 1 2 3 4 5 m: x2 m m m m m 1 4 9 16 25 r: + r r r r r final value 0 1 5 14 30 55 Initial value

MapReduce’s data flow

MapReduce Massive parallel processing made simple • Example: world count • Map: parse a document and generate <word, 1> pairs • Reduce: receive all pairs for a specific word, and count Map Reduce // D is a document for each word w in D output <w, 1> Reduce for key w: count = 0 for each input item count = count + 1 output <w, count>

MapReduce easily scales up Intermediate files Reduce phase Input files Map phase Output files

MapReduce Computation Input Output

Dryad • General-purpose execution environment for distributed, data-parallel applications • Concentrates on throughput not latency • Application written in Dryad is modeled as a directed acyclic graph (DAG) • Many programs can be represented as a distributed execution graph

Dryad Outputs Processing vertices Channels (file, pipe, shared memory) Inputs

Dryad • Concurrency arise from vertices running simultaneously across multiple machines • Vertices subroutines are usually quite simple as sequential programs • User have control over the communication graph • Each vertex can has multiple input and output

Approaches Automatically parallelize users’ programs; Programs must follow the specific model Users are relaxed from the details of distributing the execution

Tasking of execution • Performance • Locality is crucial • Speculative execution • Fairness • The same cluster shared by multiple users • Small jobs requires small response time while throughput is important for big jobs • Correctness • Fault-tolerance

Locality and fairness • Locality is crucial • Bandwidth is scarce resource • Input data with duplications are stored in the same cluster for executions • Fairness • Short jobs requires short response time Locality and fairness conflicts with each other

FIFO scheduler in Hadoop • Jobs in a queue with priority order • FIFO by default • When there are available slots • Assign slots to tasks, that have local data, in priority order • Limit the assignment of non-local task to optimize locality

FIFO scheduler 2 tasks JobQueue 1 tasks Node 4 Node 1 Node 2 Node 3

FIFO scheduler – locality optimization Only dispatch one non-local task at one time 4 tasks JobQueue 1 tasks Node 4 Node 1 Node 2 Node 3 Far away in network topology

Problem: fairness 3 tasks JobQueue 3 tasks Node 4 Node 1 Node 2 Node 3

Problem: response time JobQueue 3 tasks 3 tasks Small job: Only 1 task) 1 task Node 4 Node 1 Node 2 Node 3

Fair scheduling • Assign free slots to the job that has the fewest running tasks • Strict fairness • Running jobs gets nearly equal number of slots • The small jobs finishes quickly

Fair Scheduling JobQueue Node 4 Node 1 Node 2 Node 3

Problem: locality JobQueue Node 4 Node 1 Node 2 Node 3

Delay Scheduling • Skip the job that cannot launch a local task • Relax fairness slightly • Allow a job to launch non-local tasks if be skipped long enough • Avoid starvation

Delay Scheduling 0 0 0 1 2 0 0 skipcount Waiting time is short: Tasks finish quickly Skipped job is in the head of the queue Threshold: 2 JobQueue Node 4 Node 1 Node 2 Node 3

“Fault” Tolerance • Nodes fail • Re-run tasks • Nodes are slow (stragglers) • Run backup tasks (speculative execution) • To minimize job’s response time • Important for short jobs

Speculative execution • The scheduler schedules backup executions of the remaining in-progress tasks • The task is marked as completed whenever either the primary or the backup execution completes • Improve job response time by 44% according Google’s experiments

Speculative execution mechanism Seems a simple problem, but • Resource for speculative tasks is not free • How to choose nodes to run speculative tasks? • How to distinguish “stragglers” from nodes that are slightly slower? • Stragglers should be found out early

Hadoop’s scheduler • Start speculative tasks based on a simple heuristic • Comparing each task’s progress to the average • Assumption of homogeneous environment • The default scheduler works well • Broken in utility computing • Virtualized “utility computing” environments, such as EC2 How to robustly perform speculative execution (backup tasks) in heterogeneous environments?

Speculative execution in Hadoop • When there is no “higher priority” tasks, looks for a task to execute speculatively • Assumption: The is no cost to launching a speculative task • Comparing each task’s progress to the average progress • Assumption: Nodes perform similarly. (“Slow node is faulty”; “Nodes that ask for new tasks are fast”) • Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast

Speculative execution in Hadoop • Threshold for speculative execution • (Average progress score of each category of tasks) – 0.2 • Tasks beyond the threshold are “equally slow” • Ranks candidates by locality • Wrong tasks may be chosen • 35% completed 2x slower task with data available on idle node or 5% completed 10x slower task? • Too many speculative tasks and thrashing • Taking away resources from useful tasks

Speculative execution in Hadoop • Progress score • Map: fraction of input data • Reduce: three phase (1/3 for each) and fraction of data processed • Incorrect speculation of reduce tasks • Copy phase takes most of the time, but account only 1/3 • 30% tasks finishes quickly, 70% are in copy phase: Avg. progress rate = 30%*1+70%*1/3 = 53%, threshold=33%

LATE • Longest Approximate Time to End • Principles • Ranks candidate by longest time to end • Choose the right task that hurts the job’s response time; slow nodes can be utilized as long as it doesn’t hurt the response time • Only launch speculative tasks on fast nodes • Not every node that asks for task is fast • Cap speculative tasks • Limit resource contention and thrashing

LATE algorithm • Cap speculative tasks If a node asks for a new task and there are fewer than SpeculativeCap speculative tasks running: • Ignore the request if the node's total progress is below SlowNodeThreshold • Rank currently running tasks by estimated time left • Launch a copy of the highest-ranked task with progress rate below SlowTaskThreshold • Only launch speculative • tasks on fast nodes • Rank candidates by • longest time to end

Approaches Automatically parallelize users’ programs; Programs must follow the specific model Users are relaxed from the details of distributing the execution Language level

Language level approach • Programming frameworks • Still not clear and compact enough • Traditional programming language • Without giving special focus on high parallelism for large computing cluster • New language • Clear, compact and expressive • Automatically parallelized “normal” programs • Comfortable way for user to think about data processing problem on large distributed datasets

Sawzall • Interpreted, procedural high-level programming language • Exploit high parallelism • Automate very large data sets analysis • Give users a way to clearly and expressively design distributed data processing programs

Overall flow • Filtering • Analysis each record individually • Expressed in Sawzall • Aggregation • Collate and reduce the intermediate values • Predefined aggregators Map Reduce

An example Find out the most-linked-to page of each domain Aggregator: highest value Stores url Indexed by domain Weighted by pagerank max_pagerank_url: table maximum(1)[domain:string] of url:string weight pagerank:int; doc:Document = input; emit max_pagerank_url[domain(doc.url)] <- doc.url weight doc.pagerank; input: pre-defined variable initialized by Sawzall Interpreted into Documentn type emit: sends intermediate value to the aggregator

Unusual features • Sawzall runs on one record at a time • Nothing in the language to have one input record influent another • emit statement is the only output primitive • Explicit line between filtering and aggregation Enables high degree of parallelism even though it is hidden from the language

Approaches Automatically parallelize users’ programs; Programs must follow the specific model Users are relaxed from the details of distributing the execution Clearer, more expressive More restrict programming model Language level Comfortable way for programming Instruction level

Instruction level approach • Provides instruction level abstracts and compatibility to users’ applications • May choose traditional ISA such as x86/x86-64 • Run traditional applications without any modification • Easier to migrate applications to cloud computing environments

Amazon Elastic Compute Cloud (EC2) • Provides virtual machines runs traditional OS • Traditional programs can work on EC2 • Amazon Machine Image (AMI) • Boot instances • Unit of deployment, packaged-up environment • Users design and implement the application logic in AMI; EC2 handles the deployment and resource allocation

vNUMA Virtual shared-memory multiprocessor machine build from commodity workstations • Make the computational power available to legacy applications and OSs VM VM VM VM PM PM PM PM Virtualization vNUMA

Architecture • Hypervisor • On each node • CPU • Virtual CPUs are mapped to real CPUs on nodes • Memory • Divided between the nodes with equal-sized portions • Each node manages a subset of the pages

Memory mapping In application’s virtual memory address VM read *a Application translate a to VM’s physical memory address b OS maps b to real physical address c on node VMM find *c PM PM

Approaches Automatically parallelize users’ programs; Programs must follow the specific model Users are relaxed from the details of distributing the execution Clearer, more expressive More restrict programming model Language level Comfortable way for programming Users handles the tasking Supports traditional applications Instruction level Hard to scale up

Survey on Programming and Tasking in Cloud Computing Environments