Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks

Data Consolidation:A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos Department of Computer Engineering and informatics, University of Patras, Greece and Research Academic Computer Technology Institute, Patras, Greece Conference: CCGRID 2008

Outline • Introduction • Previous Work • Problem Formulation • Data Consolidation Techniques • Simulation • Conclusion

Introduction • Lots of applications benefit from Grid computing: • Computation-intensive applications: Involving computationally intensive problems on small datasets. • Data-intensive applications: Performing computations on large sized data stored at geographically distributed resources. (NOTE: Such Grid is usually referred to as a Data-Grid.)

Introduction • We evaluate a task scheduling and data migration problem called data consolidation(DC).

Previous Work • Most of the related works assume that each task needs, for its execution, only one piece of large data. As a result, this obvious scenario is ignored in most related works.

Previous Work • In “Intelligent Scheduling and Replication in Datagrids: a Synergistic Approach” • Each task need one or more pieces of data for its execution. • Tabu-search scheduler • Optimize execution time and system utilization

Problem Formulation • A Grid Network consists of: • a set R of N sites: each r∈R contains at least one of the following entities • computation resource • storage resource • network resource • Each computation resource has a local scheduler and a queue. • There is a central scheduler responsible for the task scheduling and data management. (This scheduler has complete knowledge of the static and dynamic characteristics of the sites)

Problem Formulation

Problem Formulation • On receiving the user’s request, the central scheduler examines the computation and data related characteristics of the task. • Based on the used DC algorithm, the central scheduler selects: • The sites that hold replicas of the datasets the task needs. • The site where these datasets will consolidate and the task will be executed. (This site is called DC site.) NOTE: The inequality must be satisfied.

Problem Formulation • The scheduler orders the data holding sites to transfer the datasets to the DC site. • And orders the user to transfer his task to the DC site. • After the task finishes execution, the result return back to the originating user.

Theoretical Analysis • Assume that the scheduler has selected the data holding sites, rk∈R, for all datasets Ik, k=1,2,…,L, and the DC site. • DC site may already have some pieces of data and thus no transferring is required for these pieces.

Theoretical Analysis • In general, the data-intensive task experiences • communication delay (Dcomm) • processing delay (Dproc)

Theoretical Analysis • communication delay (Dcomm) Dcomm=Dcons+Doutput =

Theoretical Analysis • processing delay (Dproc) Dproc=

Theoretical Analysis • The total delay suffered by a task is DDC=Dcomm+Dproc.

Proposed Techniques • We propose a number of categories of DC algorithms: • Time • ConsCost • ExecCost • TotalCost • Traffic • SmallTrans

Time • Consolidation-Cost (ConsCost) algorithm: We select the replicas and the DC site that minimize the data consolidation time (Dcons) • Given a candidate DC site rj, for each Ik we search ri holding Ik such that is min, and hence the data consolidation time of rj is • Finally, we can determine the DC site:

Time • Execution-Cost (ExecCost) algorithm: We select the DC site that minimizes the task’s execution time: While the data replicas are randomly chosen. NOTE: is difficult to calculate, but we can estimate it based on: • the tasks already assigned to it (ri). • the average delay the tasks executed on it have experienced.

Time • Total-Cost (TotalCost) algorithm: We select the replicas and the DC site that minimize the total task delay. Namely, the algorithm is the combination of the two above algorithms (ConsCost and ExecCost).

Traffic • Smallest-Data Transfer (SmallTrans) algorithm: We select the DC site for which the smallest number of datasets (or the datasets with the smallest total size) need to be consolidated for the task’s execution.

Random • Random-Random (Rand) algorithm: The data replicas used by the task and the DC site are randomly chosen. • Random-Origin (RandOrig) algorithm: The data replicas used by the task are randomly chosen and the DC site is the one that created the task.

Simulation • We use NSFNET topology, which contains: • 14 nodes (only 5 nodes are equipped with a computation and storage resource (such nodes are called sites)) • And each site has equal storage and computation capacity. • one node exists in the network acting as a Tier 0 site and holds all the datasets • 21 links (all link capacities are equal to 1Gbps)

NSFNET topology

Assumption • Only one transmission is possible at a time over a link. • is not taken into account. • 50 datasets exist in the network initially. Two copies exists for each dataset. (one is distributed among 5 sites, the other is placed at Tier 0 site) • In each experiment, users generate a total of 50,000 tasks. • We keep constant the average total data size S (15000 MB): (L: number of datasets a task requests;I: the average size of each dataset) And we examine the following (L,I) pair values: (2,7500),(3,5000),(4,3750),(6,2500),(8,1875),(10,1500) • The workload of a task correlates with the average total data size: (a is a parameter such that tasks are more data-intensive as a decreases)

Simulations DC probability: The probability that the DC site will not have all the required datasets

Simulations Task delay: the time between its creation and the time it completes.

Simulations Network load depends on: 1. the size of datasets transferred. 2. the number of hops these datasets traverse.

Simulations

Conclusion • If DC is performed efficiently, important benefits can be obtained in terms of task delay, network load and other performance parameters of interest.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks

Presentation Transcript

Data Mining Technique

Storage Solutions for Data Consolidation

Data Center Consolidation

Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks

Migration Cost Aware Task Scheduling

Data Integration and Consolidation

BMS Data Integration and Consolidation

Migration Cost Aware Task Scheduling

Data Center Consolidation and Virtualization

Data Center Consolidation

Data Mining Technique

GHS: A Performance Prediction and Task Scheduling System for Grid Computing

Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration

Data Science for Migration Data

Data Migration

Data Consolidation and Management

BMS Data Integration and Consolidation

Data Migration

Managing and Scheduling Data Placement (DaP) Requests in GRID

CAD Consolidation CAD Collaboration, CAD Data Migration, and More