1 / 21

Data Grids

Data Grids. Darshan R. Kapadia Gregor von Laszewski. Grids. We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc.

ilana
Download Presentation

Data Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Grids Darshan R. Kapadia Gregor von Laszewski http://grid.rit.edu

  2. Grids • We’ve seen computational grids – collections of computing clusters and protocols/software in order to submit jobs, distribute work, schedule jobs, monitor status, etc. • But how do we manage collections of data on a grid – not just the computations / programs themselves? http://grid.rit.edu

  3. Data GRID Lothar A T Bauerdick (2003). Grid Tools and the LHC Data Challenges. LHC Symposium. May 3, 2003. http://grid.rit.edu

  4. Why data grids? • The immense computational demands of many scientific applications are often coupled with massive amounts of data. • These data sets must be shared by a virtual organization (or multiple VOs) for a variety of computations • Distributing jobs to diverse geographic computing resources also requires distributing data collections for processing and storing output. http://grid.rit.edu

  5. Data Grid Challenges • Storage capacity for massive quantities of data • Distribute data sets to disperse geographic locations to complete jobs in a grid • Maximize computation to communication ratio • Aggregation of results, data coherency • Who has “the” copy of the data set • Need to do all of this securely and robustly http://grid.rit.edu

  6. Functions of Data GRID • Data Access • How do we access and manage data? • Storage Resource Brokers • UNIX File Systems, Distributed File Systems, HTTP servers, etc • How do we transfer data? • Metadata Access • Data about data! • Replica Management • Create/delete copies of data • Replica “catalogs” • Replica Selection • Locating the best data replica to use for an application • Determine subset of data required for a job

  7. Earth System GRID The Earth System Grid (ESG) integrates supercomputers with large-scale data and analysis servers located at numerous national labs and research centers to create a powerful environment for next generation climate research. Participating Organization Argonne National Laboratory Lawrence Berkeley National Laboratory Lawrence Livermore National Laboratory Los Alamos National Laboratory National Center for Atmospheric Research Oak Ridge National Laboratory University of Southern California/Information Sciences Institute http://www.earthsystemgrid.org/

  8. High Energy Physics Application B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771. http://grid.rit.edu

  9. Data GRID Architecture Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356. http://grid.rit.edu

  10. Data Grid Design • Mechanism Neutrality • Policy Neutrality • Compatibility with Grid Infrastructure • Uniformity of Information Infrastructure http://grid.rit.edu

  11. Core Data GRID services • Storage System and Data Access • Data Abstraction: Storage System • Data Access • Metadata Services http://grid.rit.edu

  12. High Level Data Grid Components • Replica Management • Replica Selection and Data Filtering http://grid.rit.edu

  13. GASS • Globus Access to Secondary Storage [5] • NOT a distributed file system • Unix (C-style) fopen/fclose • Default behavior is to transfer entire file from remote site into a local cache when file is opened • GASS also provides finer-tuned control. • Pre-stage/Post-stage file accesses • Cache management • No cache coherency (changes made to remote file do not get propagated to caches) http://grid.rit.edu

  14. Contd.. Commands • globus_gass_fopen • globus_gass_fclose • File names are URLs http://grid.rit.edu

  15. GridFTP • GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks. • Based on FTP (RFC-959) • Extended for higher-performance, flexibility, and robustness • Parallel data sources, parallel transfers • Partial file transfers • Transfer restart capabilities http://grid.rit.edu

  16. GridFTP • Can Use GSI for security. • TeraGrid has three clients which utilize GridFTP • UberFTP(recommended) • Globus-url-copy(preferred for scripting) • tgcp (deprecated)

  17. Amazon Simple Storage Service (Amazon S3™) Amazon S3 is storage for the Internet. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.

  18. AWS S3 Functionalities • Write, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited. • Each object is stored in a bucket and retrieved via a unique, developer-assigned key. • Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users. • Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit. http://grid.rit.edu

  19. Replica Management A Taxonomy of Data Grids for Distributed Data Sharing, Management, and Processing KUMAR VENUGOPAL, RAJKUMAR BUYYA, AND KOTAGIRI RAMAMOHANARAO http://grid.rit.edu

  20. Conclusion • Data Grid involves maintenance of large amount of data, So it is unique in terms of its architecture. • Data Grid are very important for the future as large amount of data will be required for future applications.

  21. References • http://www.earthsystemgrid.org/ • Chervenak, A., Deelman, E., Kesselman, C., Allcock, B., Foster, I., & Nefedova, V., et al. (2003). High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Comput., 29(10), 1335-1356. • Allcock, W., Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., & Tuecke, S. (2001). The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications, 23, 187-200. • Bester, J., Foster, I., Kesselman, C., Tedesco, J., & Tuecke, S. (1999). GASS: A Data Movement and Access Service for Wide Area Computing Systems. Paper presented at the Proceedings of IOPADS'99. • B. Allcock J. Bester, B. C. F. K. M. N. Q. T. J. . A. L. . I. . C. . S. . V. . D. . S. (2002). Data Management and Transfer in High Performance Computational Grid Environments. Parallel Computing Journal, 28(5), 749-771. http://grid.rit.edu

More Related