260 likes | 366 Views
Luis Diego Briceño , Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell, Russ Wakefield, Abdulla Al-Qawasmeh, Ron C. Chiang, and Jiayin Li. outline motivation and introduction system model robustness example of heuristic results and conclusions.
E N D
Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell, Russ Wakefield, Abdulla Al-Qawasmeh, Ron C. Chiang, and Jiayin Li outline • motivation and introduction • system model • robustness • example of heuristic • results and conclusions Robust Resource Allocation of DAGs in a Heterogeneous Multi-core System Supported by the NSF under grants CNS-0615170 and CNS-0905399
Motivation • need to execute applications on satellite data • satellite data is processed in a heterogeneous computing system • results are needed before a deadline deadline multi-core heterogeneous data processing system • satellite • data • app1 result • app2 applications ...
Problem Statement t1,α • multiple applications (for this presentation consider one) • each application is a DAG of tasks • a set of applications must complete before a deadline Δ • completion time of an application must be robust against uncertainties in the estimated execution time of its tasks • actual time is data dependent • goal:robust resource allocation of data and tasks to heterogeneous multi-core system to meet deadlineΔ forapplications application α time t2,α t3,α t4,α t5,α t6,α Δ t7,α
Environment • consider a heterogeneous environment used to analyze satellite imaging • based on commodity hardware • these environments require analysis of large data sets • environment similar to systems in use at • National Center for Atmospheric Research (NCAR) • DigitalGlobe • static resource allocation • estimated time to compute a task is known in advance
Contributions • contributions • model and simulation of a complex multi-core-based data processing environment that executes data intensive applications • multi-core machines • RAM management • hard drive management • parallel tasks • satellite data placement • a robustness metric for this environment • resource allocation heuristics to maximize robustness using this metric
System Model — Satellite Data Placement • satellite data is split into smaller subsets and distributed among the hard drives of the compute nodes multi-core heterogeneous data processing system • satellite • data • satellite • data • processing element (PE) is a core • PEj,k— PE k on compute node j (1 – 8 per node) • PEs within a compute node are homogeneous • no multi-tasking within a PE • HDj RAMj computenode j PEj,1 PEj,8 …
System Model — Processing • tasks execute on processing elements (PEs) [if data on HDj] • required input data must be present in RAM to execute task satellite data at compute node j input data sets • ex. results task 1 • input data sets are staged to RAM • task 1 (t1) can start execution • result is stored in RAMj • RAM space is limited • HDj RAMj computenode j t1 PEj,1 PEj,8 …
System Model — RAM Management • RAM has a fixed capacity • 160Gbytes (based on DigitalGlobe computer center) • assume 152Gbytes available for data • typical data set was from 1Gbyte to 32Gbytes • data sets can be swapped in and out of RAM if needed later • all input data sets must be in RAM before task execution • data sets must remain in RAM until execution is finished • must reserve space in RAM for result
System Model — Storage • satellite data sets allocated prior to task execution • two scenarios for satellite data allocation • determined by the heuristic • randomly assigned (pre-determined) • inter-task data is transmitted if destination is not equal to source
System Model — Applications • each application appα must complete before Δ • appαis divided into Tα tasks (tasks form a DAG) • each task requires satellite data sets or produced inter-task data sets • ti,αis the ith task in the application α • each task produces other data items (e.g.,data 7) • last task produces a result appα data 3 t1,α • t3,α • result • sat. data 1 data 2 • t2,α • data 7 • sat. data 4 • sat. data 6
System Model — Computation Parallelism • 50% of tasks are parallelizable • only parallelizable on PEs in the same compute node • parallel time = sequential time / divider • parallel execution time is used to model different speed ups • two types of parallelizable tasks • 25% good parallel tasks • 25% average parallel tasks • divider values chosen arbitrarily for the simulation study
Robustness — Three Questions • What behavior of the system makes it robust? • all applications finish before Δ • What uncertainties is the system robust against? • differencesbetween actual and estimated times • assume communications times are fixed • Quantitatively, exactly how robust is the system? • smallest common percentage increase (ρ) for all task execution times that causes the makespan > Δ • note: in a real system, the execution times of all tasks will not be increased by the same common percentage • ρis just a mathematical value used as a robustness measure
Robustness — Example • assume 3 applications • blue (b, d, g, and h), green (a, e, and i), and pink(c and f) Δ Δ g′ i′ h′ makespan i g completion time completion time h f′ d′ e′ f e d a′ c′ a c b′ b PE2,1 PE1,1 PE2,1 PE3,1 PE1,1 PE3,1 makespan based on estimated task time makespan when task times = ρ∙ estimated task time
Related Work • significant amount of research • assign a DAG to a heterogeneous computing system • several critical path heuristics • robustness in resource allocation • our research considers the robustness of theallocation in DAGs • two heuristics for minimization of makespan from literature were adapted to this paper • heuristics originally meant to minimize makespan • adapted heuristics can handle memory, satellite data placement, and robustness • Dynamic Available Tasks Critical Path (DATCP) heuristic • will be explained today
Dynamic Available Tasks Critical Path (DATCP) critical path value average exec. time outline • calculate the critical path for each application • for each task, fromtexit to tentry • edge labels are average transfer time/byte betweenany two nodes ∙ data size • determine the maximum time from any successor (child) node to the texit(maxtime) • critical path value is the sum of task data and satellite data transfer times, maxtime, and average execution time of ti 37 3 8 6 7 26 23 27 5 4 6 5 8 3 5 17 14 7 3 5 4 6 6
Dynamic Available Tasks Critical Path (DATCP) critical path value average exec. time outline • calculate the critical path for each application • dynamically create a list of all tasks available for mapping • determine the task with the longest critical path from the list of available tasks • task ti determined in (3) is assigned to the PE that gives the maximum system robustness based on partial mapping • repeat steps (2)–(4) until all tasks are mapped 37 3 8 6 7 26 23 27 5 4 6 5 8 3 5 17 14 7 3 5 4 6 6
Dynamic Available Tasks Critical Path (DATCP) critical path value average exec. time outline • calculate the critical path for each application • dynamically create a list of all tasks available for mapping • determine the task with the longest critical path from the list of available tasks • task ti determined in (3) is assigned to the PE that gives the maximum system robustness based on partial mapping • repeat steps (2)–(4) until all tasks are mapped 37 3 8 6 7 26 23 27 5 4 6 5 8 3 5 17 14 7 3 5 4 6 6
Dynamic Available Tasks Critical Path (DATCP) critical path value average exec. time outline • calculate the critical path for each application • dynamically create a list of all tasks available for mapping • determine the task with the longest critical path from the list of available tasks • task ti determined in (3) is assigned to the PE that gives the maximum system robustness based on partial mapping • repeat steps (2)–(4) until all tasks are mapped 37 3 8 6 7 26 23 27 5 4 6 5 8 3 5 17 14 7 3 5 4 6 6
Dynamic Available Tasks Critical Path (DATCP) critical path value average exec. time outline • calculate the critical path for each application • dynamically create a list of all tasks available for mapping • determine the task with the longest critical path from the list of available tasks • task ti determined in (3) is assigned to the PE that gives the maximum system robustness based on partial mapping • repeat steps (2)–(4) until all tasks are mapped 37 3 8 6 7 26 23 27 5 4 6 5 8 3 5 17 14 7 3 5 4 6 6
DATCP — Memory Management • determine available space in RAM • decide if the required task and the input data can be stored in RAM immediately • if there is not enough space • heuristic checks when the task's input data sets can be moved into memory • heuristic schedules task to start execution at that time • if incoming data is from another compute node • send it to destination compute node’s RAM • if there is no space in RAM then send to the HD
DATCP — Parallelizable Tasks • two approaches are studied • no parallelization • “max” approach • heuristic always parallelizes across multiple PEs within a compute node • determine system robustness for each possible assignment • determine the node with the most PEs that have same maximum robustness • map the task to all PEs that have the same robustness value within this compute node
DATCP — Satellite Data Placement • two methods • random placement • first time a satellite data set is required, that data set and the task that requires it are mapped • task is assigned to the PE that maximizes robustness • storage location of satellite data set has not been previously determined • satellite data set is stored in the HD of this PE's corresponding compute node
Results DATCP 1: Max parallel with satellite mapping DATCP 2: Max parallel with random satellite mapping DATCP 3: no parallelism with random satellite mapping HRD 1: satellite data (SD) placement based on first task placement with duplication HRD 2: SD placement based on first task placement with no duplication HRD 3: SD placement based on reference count with no duplication HRD 4: random SD placement with duplication HRD 5: random SD placement and no duplication
Conclusions • derived a metric to measure the robustness • interdependency of tasks within applications complicate the derivation of a robustness metric • DATCP has highest average robustness values • initial ordering created by DATCP is much better than the order created by HRD • if DATCP order is used in HRD then the results of HRD are significantly improved • satellite data placement did not have any apparent effect on robustness