210 likes | 395 Views
Data Grids. Darshan R. Kapadia Gregor von Laszewski. Grids. We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc.
E N D
Data Grids Darshan R. Kapadia Gregor von Laszewski http://grid.rit.edu
Grids • We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc. • But how do we manage collections of data on a grid – not just the computations / programs themselves? http://grid.rit.edu
Data GRID Lothar A T Bauerdick (2003). Grid Tools and the LHC Data Challenges. LHC Symposium. May 3, 2003. http://grid.rit.edu
Why data grids? • The immense computational demands of many scientific applications are often coupled with massive amounts of data. • These data sets must be shared by a virtual organization (or multiple VOs) for a variety of computations • Distributing jobs to diverse geographic computing resources also requires distributing data collections for processing and storing output. http://grid.rit.edu
Data Grid Challenges • Storage capacity for massive quantities of data • Distribute data sets to disperse geographic locations to complete jobs in a grid • Maximize computation to communication ratio • Aggregation of results, data coherency • Who has “the” copy of the data set • Need to do all of this securely and robustly http://grid.rit.edu
Functions of Data GRID • Data Access • How do we access and manage data? • Storage Resource Brokers • UNIX File Systems, Distributed File Systems, HTTP servers, etc • How do we transfer data? • Metadata Access • Data about data! • Replica Management • Create/delete copies of data • Replica “catalogs” • Replica Selection • Locating the best data replica to use for an application • Determine subset of data required for a job
Earth System GRID The Earth System Grid (ESG) integrates supercomputers with large-scale data and analysis servers located at numerous national labs and research centers to create a powerful environment for next generation climate research. Participating Organization Argonne National Laboratory Lawrence Berkeley National Laboratory Lawrence Livermore National Laboratory Los Alamos National Laboratory National Center for Atmospheric Research Oak Ridge National Laboratory University of Southern California/Information Sciences Institute http://www.earthsystemgrid.org/
High Energy Physics Application B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771. http://grid.rit.edu
Data GRID Architecture Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356. http://grid.rit.edu
Data Grid Design • Mechanism Neutrality • Policy Neutrality • Compatibility with Grid Infrastructure • Uniformity of Information Infrastructure http://grid.rit.edu
Core Data GRID services • Storage System and Data Access • Data Abstraction: Storage System • Data Access • Metadata Services http://grid.rit.edu
High Level Data Grid Components • Replica Management • Replica Selection and Data Filtering http://grid.rit.edu
GASS • Globus Access to Secondary Storage [5] • NOT a distributed file system • Unix (C-style) fopen/fclose • Default behavior is to transfer entire file from remote site into a local cache when file is opened • GASS also provides finer-tuned control. • Pre-stage/Post-stage file accesses • Cache management • No cache coherency (changes made to remote file do not get propagated to caches) http://grid.rit.edu
Contd.. Commands • globus_gass_fopen • globus_gass_fclose • File names are URLs http://grid.rit.edu
GridFTP • GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks. • Based on FTP (RFC-959) • Extended for higher-performance, flexibility, and robustness • Parallel data sources, parallel transfers • Partial file transfers • Transfer restart capabilities http://grid.rit.edu
GridFTP • Can Use GSI for security. • TeraGrid has three clients which utilize GridFTP • UberFTP(recommended) • Globus-url-copy(preferred for scripting) • tgcp (deprecated)
Amazon Simple Storage Service (Amazon S3™) Amazon S3 is storage for the Internet. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.
AWS S3 Functionalities • Write, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited. • Each object is stored in a bucket and retrieved via a unique, developer-assigned key. • Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users. • Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit. http://grid.rit.edu
Replica Management A Taxonomy of Data Grids for Distributed Data Sharing, Management, and Processing KUMAR VENUGOPAL, RAJKUMAR BUYYA, AND KOTAGIRI RAMAMOHANARAO http://grid.rit.edu
Conclusion • Data Grid involves maintenance of large amount of data, So it is unique in terms of its architecture. • Data Grid are very important for the future as large amount of data will be required for future applications.
References • http://www.earthsystemgrid.org/ • Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356. • Allcock, W., Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23, 187-200. • Bester, J., Foster, I., Kesselman, C., Tedesco, J., & Tuecke, S. (1999). GASS: A Data Movement and Access Service for Wide Area Computing Systems. Paper presented at the Proceedings of IOPADS'99. • B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771. http://grid.rit.edu