Chameleon: A Resource Scheduler in A Data Grid Environment

Chameleon: A Resource Scheduler in A Data Grid Environment Sang Min Park  Jai-Hoon Kim Ajou University South Korea

Contents • Introduction to Data Grid • Related Works • Scheduling Model • Scheduler Implementation • Testbed and Application • Results • Conclusions

Introduction to Data Grid • Data Grid Motivations • Petabyte scale data production • Distributed data storage to store parts of data • Distributed computing resources which process the data • Two Most Important Approaches for Data Grid • Secure, reliable, and efficient data transport protocol (ex. GridFTP) • Replication (ex. Replica catalog) • Replication • Large size files are partially replicated among sites • Reduce data access time • Application Scheduling, Dynamic replication issues are emerging

Related Works • Data Grid • Replica catalog – mapping from logical file name to physical instance • GridFTP – Secure, reliable, and efficient file transfer protocol • Job Scheduling • Various scheduling algorithms for computational Grid • Application Level Scheduling (AppLes) • Large data collection has not been concerned • Job Scheduling in Data Grid • Roughly analytical and simulation studies are presented • Our works define more in-depth scheduling model

Scheduling Model - Assumptions • Assumptions • Site has both data storage and computing facilities • Files are replicated at part of Grid sites • Each site has different amount of computational capability • Grid users request job execution through Job schedulers

Scheduling Model - System Factors • Dynamic system factors - Factors change over time • Network bandwidth • Data transfer time is proportional to network bandwidth • NWS- tool for measuring and forecasting network bandwidth • Available computing nodes • Determines execution time of jobs • Decided according to job load on a site • System attributes • Machine architecture (clusters, MPPs, etc) • Processor speed, Available memory, I/O performance, etc.

Scheduling Model - System Factors • Application specific factors - Unique factors Data Grid applications have • Size of input data (replica) • If not in the computing site, data fetch is needed • Much time will be consumed to transfer large size data • Size of application code • Application code should be migrated to sites which perform computation • Not critical to the overall performance (small size) • Size of produced output data • When the computing job takes place at the remote site, result data should be returned back to the local • Strongly related to the size of input data

Scheduling Model - application scenarios • The model consists of 5 distinct application scenarios • Local Data and Local Execution • Local Data and Remote Execution • Remote Data and Local Execution • Remote Data and Same Remote Execution • Remote Data and Different Remote Execution

Scheduling Model - application scenarios • Terms in the scenarios

Scheduling Model - application scenarios • Local Data and Local Execution • Input data (replica) is located in local, and processing is performed with local available processors • Data in move consists of • Input data (replica) • Application code • Output data • Cost consists of • Data transfer time between master and computing nodes via LAN • Job execution time using local processors

Scheduling Model - application scenarios 2. Local Data and Remote Execution • Locally copied replica is transferred to remote computation site • Cost consists of • Data (input+codes+output) movement time via WAN between local and remote site • Data movement time via LAN in a remote site • Job execution time on a remote site

Scheduling Model - application scenarios 3. Remote Data and Local Execution • Remote replica is copied into local site, and processing is performed on local • Cost consists of • Input data movement time via WAN between local and remote site • Data movement time via LAN in a local site • Job execution time on a local processors

Scheduling Model - application scenarios 4. Remote Data and Same Remote Execution • Remote site having replica performs computation • Cost consists of • Data (code+output) movement time via WAN between local and remote site • Data movement time via LAN in a remote site • Job execution time on a remote site

Scheduling Model - application scenarios 5. Remote Data and Different Remote Execution • Remote site j performs computation with replica copied from remote site i • Cost consists of • Input replica movement time via WAN between remote site i and j • Data (codes + output) movement time via WAN between local and remote j • Data movement time via LAN in a remote site j • Job execution time in a remote site j

Scheduling Model - scheduler • Operations of the scheduler • Predict the response time of each scenario • Compare the response time of scenarios • Choose the best scenario and sites holding data and to perform job execution • Requests data movement and job execution

Scheduler Implementation • Develop scheduler prototype, called Chameleon, for evaluating the scheduling model • Built on top of services provided by Globus • GRAM • MDS • GridFTP • Replica Catalog • NWS is used for measuring and forecasting network bandwidth • Scheduling algorithms are based on the scheduling models presented

Testbed for experiments

Applications • Gene sequence comparison applications (Bioinformatics) • Computationally intensive analysis on the large size protein database • Bio-scientists predict structure and functions of newly found protein by comparing it with well known protein database • The size of database reaches over 500 MB • There are various versions of protein database • Large databases are replicated in Data Grid • Two well-known applications, Blast and FASTA, are executed

Applications - parameters

Experimental Results (1) Results when executing PSI-BLAST Replication scenario

Experimental Results (2) Results on the previous slide Results when executing FASTA in the above replication scenario

Experimental Results (3) No replication takes place Results when executing PSI-BLAST

Experimental Results (4) Increasing the number of replica Decreasing response time

Conclusions • Job scheduling models for Data Grid • The models consist of 5 distinct scenarios • Scheduler prototype, called Chameleon, is developed which is based on the presented scheduling models • Perform meaningful experiments with Chameleon on a constructed Grid testbed • We achieve better performance by considering data locations as well as computational capabilities

References • ANTZ: http://www.antz.or.kr • ApGrid: http://www.apgrid.org • B. Allcock, J. Bester, J. Bresnahan, A. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, S. Tuecke. “Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing,” IEEE Mass Storage Conference, 2001. • Mark Baker, Rajkumar Buyya and Domenico Laforenza. “The Grid: International Efforts in Global Computing,” International Conference on Advances in Infrastructure for E-Business, Science, and Education on the Internet, SSGRR2000, L'Aquila, Italy, July 2000. • F. Berman and R. Wolski. “The AppLes project: A status report,” Proceedings of the 8th NEC Research Symposium, Berlin, Germany, May 1997. • Rajkumar Buyya, Kim Branson, Jon Giddy and David Abramson. “The Virtual Laboratory: A Toolset for Utilising the World-Wide Grid to Design Drugs,” 2nd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2002), Berlin, Germany, May 2002. • CERN DataGrid Project: http://www.cern.ch/grid/ • Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury and Steven Tuecke. “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets,” Journal of Network and Computer Applications, 23:187-200, 2001. • Dirk Düllmann, Wolfgang Hoschek, Javier Jean-Martinez, Asad Samar, Heinz Stockinger and Kurt Stockinger. “Models for Replica Synchronisation and Consistency in a Data Grid,” 10th IEEE Symposium on High Performance and Distributed Computing (HPDC-10), San Francisco, California, August 2001. • I. Foster and C. Kesselman. “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999. • I. Foster, C. Kesselman and S. Tuecke. “The Anatomy of the Grid: Enabling Scalable Virtual Organizations,” International J. Supercomputer Applications, 15(3), 2001. • Cynthia Gibas. “Developing Bioinformatics Computer Skills,” O’REILLY, April 2001. • The Globus Project: http://www.globus.org

References • Leanne Guy, Erwin Laure, Peter Kunszt, Heinz Stockinger, and Kurt Stockinger. “Replica management in data grids,” Technical report, Global Grid Forum Informational Document, GGF5, Edinburgh, Scotland, July 2002. • Wolfgang Hoschek, Javier Jaen-Martinez, Asad Samar, Heinz Stockinger and Kurt Stockinger. “Data Management in an International Data Grid Project,” • 1st IEEE/ACM International Workshop on Grid Computing (Grid'2000), Bangalore, India, Dec 2000. • Kavitha Ranganathan and Ian Foster. “Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications,” 11th IEEE International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, July 2002. • Kavitha Ranganathan and Ian Foster. “Design and Evaluation of Dynamic Replication Strategies for a High Performance Data Grid,” International Conference on Computing in High Energy and Nuclear Physics, Beijing, September 2001. • Kavitha Ranganathan and Ian Foster. “Identifying Dynamic Replication Strategies for a High Performance Data Grid,” International Workshop on Grid Computing, Denver, November 2001. • Heinz Stockinger, Kurt Stockinger, Erich Schikuta and Ian Willers. “Towards a Cost Model for Distributed and Replicated Data Stores,” 9th Euromicro Workshop on Parallel and Distributed Processing PDP 2001, Mantova, Italy, February 2001. • S. Vazhkudai, S. Tuecke and I. Foster. “Replica Selection in the Globus Data Grid,” Proceedings of the First IEEE/ACM International Conference on Cluster Computing and the Grid (CCGRID 2001), Brisbane, Australia, May 2001. • Rich Wolski, Neil Spring, and Jim Hayes. “The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing,” Journal of Future Generation Computing Systems, Volume 15, Numbers 5-6, pp. 757-768, October 1999.

Chameleon: A Resource Scheduler in A Data Grid Environment