360 likes | 368 Views
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks. Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos Department of Computer Engineering and informatics, University
E N D
Data Consolidation:A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos Department of Computer Engineering and informatics, University of Patras, Greece and Research Academic Computer Technology Institute, Patras, Greece Conference: CCGRID 2008
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Introduction • Lots of applications benefit from Grid computing: • Computation-intensive applications: Involving computationally intensive problems on small datasets. • Data-intensive applications: Performing computations on large sized data stored at geographically distributed resources. (NOTE: Such Grid is usually referred to as a Data-Grid.)
Introduction • We evaluate a task scheduling and data migration problem called data consolidation(DC).
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Previous Work • Most of the related works assume that each task needs, for its execution, only one piece of large data. As a result, this obvious scenario is ignored in most related works.
Previous Work • In “Intelligent Scheduling and Replication in Datagrids: a Synergistic Approach” • Each task need one or more pieces of data for its execution. • Tabu-search scheduler • Optimize execution time and system utilization
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Problem Formulation • A Grid Network consists of: • a set R of N sites: each r∈R contains at least one of the following entities • computation resource • storage resource • network resource • Each computation resource has a local scheduler and a queue. • There is a central scheduler responsible for the task scheduling and data management. (This scheduler has complete knowledge of the static and dynamic characteristics of the sites)
Problem Formulation • On receiving the user’s request, the central scheduler examines the computation and data related characteristics of the task. • Based on the used DC algorithm, the central scheduler selects: • The sites that hold replicas of the datasets the task needs. • The site where these datasets will consolidate and the task will be executed. (This site is called DC site.) NOTE: The inequality must be satisfied.
Problem Formulation • The scheduler orders the data holding sites to transfer the datasets to the DC site. • And orders the user to transfer his task to the DC site. • After the task finishes execution, the result return back to the originating user.
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Theoretical Analysis • Assume that the scheduler has selected the data holding sites, rk∈R, for all datasets Ik, k=1,2,…,L, and the DC site. • DC site may already have some pieces of data and thus no transferring is required for these pieces.
Theoretical Analysis • In general, the data-intensive task experiences • communication delay (Dcomm) • processing delay (Dproc)
Theoretical Analysis • communication delay (Dcomm) Dcomm=Dcons+Doutput =
Theoretical Analysis • processing delay (Dproc) Dproc=
Theoretical Analysis • The total delay suffered by a task is DDC=Dcomm+Dproc.
Proposed Techniques • We propose a number of categories of DC algorithms: • Time • ConsCost • ExecCost • TotalCost • Traffic • SmallTrans
Time • Consolidation-Cost (ConsCost) algorithm: We select the replicas and the DC site that minimize the data consolidation time (Dcons) • Given a candidate DC site rj, for each Ik we search ri holding Ik such that is min, and hence the data consolidation time of rj is • Finally, we can determine the DC site:
Time • Execution-Cost (ExecCost) algorithm: We select the DC site that minimizes the task’s execution time: While the data replicas are randomly chosen. NOTE: is difficult to calculate, but we can estimate it based on: • the tasks already assigned to it (ri). • the average delay the tasks executed on it have experienced.
Time • Total-Cost (TotalCost) algorithm: We select the replicas and the DC site that minimize the total task delay. Namely, the algorithm is the combination of the two above algorithms (ConsCost and ExecCost).
Traffic • Smallest-Data Transfer (SmallTrans) algorithm: We select the DC site for which the smallest number of datasets (or the datasets with the smallest total size) need to be consolidated for the task’s execution.
Random • Random-Random (Rand) algorithm: The data replicas used by the task and the DC site are randomly chosen. • Random-Origin (RandOrig) algorithm: The data replicas used by the task are randomly chosen and the DC site is the one that created the task.
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Simulation • We use NSFNET topology, which contains: • 14 nodes (only 5 nodes are equipped with a computation and storage resource (such nodes are called sites)) • And each site has equal storage and computation capacity. • one node exists in the network acting as a Tier 0 site and holds all the datasets • 21 links (all link capacities are equal to 1Gbps)
Assumption • Only one transmission is possible at a time over a link. • is not taken into account. • 50 datasets exist in the network initially. Two copies exists for each dataset. (one is distributed among 5 sites, the other is placed at Tier 0 site) • In each experiment, users generate a total of 50,000 tasks. • We keep constant the average total data size S (15000 MB): (L: number of datasets a task requests;I: the average size of each dataset) And we examine the following (L,I) pair values: (2,7500),(3,5000),(4,3750),(6,2500),(8,1875),(10,1500) • The workload of a task correlates with the average total data size: (a is a parameter such that tasks are more data-intensive as a decreases)
Simulations DC probability: The probability that the DC site will not have all the required datasets
Simulations Task delay: the time between its creation and the time it completes.
Simulations Network load depends on: 1. the size of datasets transferred. 2. the number of hops these datasets traverse.
Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion
Conclusion • If DC is performed efficiently, important benefits can be obtained in terms of task delay, network load and other performance parameters of interest.