150 likes | 279 Views
Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS W GISS-15 May 2003 Toulouse, France. The Grid Problem. Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource
E N D
Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003 Toulouse, France
The Grid Problem • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location, • central control, • omniscience, • existing trust relationships.
The Data Grid Problem “Enable a geographically distributed community [of thousands] to perform sophisticated, computationally intensive analyses on Petabytes of data” • Sounds like a separate class of problem, but is actually a superset. • So all work done on “Grid Problems” applies to “DataGrid Problems”. We just need some additional tools.
Globus Approach • Software toolkit addressing key technical areas • Offer a modular “bag of technologies” • Enable incremental development of grid-enabled tools and applications • Define and standardize grid protocols and APIs(Our software development supports this goal.) • Focus is on inter-domain issues, not clustering • Supports collaborative resource use spanning multiple organizations • Integrates cleanly with intra-domain services • Creates a “collective” service layer
Major Data Grid Projects • Earth System Grid (DOE Office of Science) • DG technologies, climate applications • European Data Grid (EU) • DG technologies & deployment in EU • GriPhyN – Grid Physics Network (NSF ITR) • Investigation of “Virtual Data” concept • Particle Physics Data Grid (DOE Science) • DG applications for HENP experiments
Basic Data Grid Services 1. GridFTP: Data Transfer and Access • Common protocol for data movement • Secure, efficient, reliable, flexible, extensible, etc. • Grid Forum (Internet) Draft • Family of tools supporting this protocol • Wu-ftpd, ncftp, Globus Toolkit SDKs, etc. 2. Replica Management Architecture Simple scheme for managing: • multiple copies of files • collections of files
GridFTP: Basic Approach • FTP is defined by several IETF RFCs • Start with most commonly used subset • Standard FTP: get/put etc., 3rd-party transfer • Implement standard but often unused features • GSS binding, extended directory listing, simple restart • Extend in various ways, while preserving interoperability with existing servers
Features of GridFTP • Grid Security Infrastructure and Kerberos support: Robust and flexible authentication, integrity, and confidentiality • Third-party control of data transfer: user or application at one site initiates, monitors and controls a data transfer between two other sites • Parallel data transfer: On wide-area links, use multiple TCP streams in parallel between the same source and destination • Striped data transfer: Use multiple TCP streams to transfer data that is striped or interleaved across multiple servers
Features of GridFTP (cont.) • Partial file transfer: Standard FTP allows transfer of the remainder of a file starting at an offset. GridFTP supports transfers of arbitrary subsets or regions of a file • Automatic negotiation of TCP buffer/window sizes: optimal settings for TCP buffer/window sizes can dramatically improve performance • Support for reliable and restartable data transfer: FTP standard includes basic features for restart that are not widely implemented. GridFTP exploits these features and extends them.
GridFTP for Efficient WAN Data Transfer • Secure authentication • Parallel transfer gets job done quickly • Partial file access gets only required data • Up to 2.8Gb/s using a striped server architecture Parallel TransferFully utilizes bandwidth of network interface on single nodes. Parallel Filesystem Parallel Filesystem Striped TransferFully utilizes bandwidth of Gb+ WAN using multiple nodes.
Current Data delivery processftp based • Pull – Semi anonymous ftp • Product ready • Email sent to user with instructions and password • User ftp via “anonymous” and with provided password • Ftp demon positions user to appropriate directory • User pull data • Push – routine data flows to high volume users • Account provided on remote system • When data available is pushed to remote system
Potential Future data deliveryGRIDftp based • For routine multiple usage customers • Establish “Certificate process” with customer • Self-signed certificate authority • Customer generates private/public key pair • Generate user certificate with public key • Add user certificate to list of trusted users • Customer must install GridFTP client • Globus toolkit data management client bundle • Gsincftp • Java Commodity Grid Kit for Windows
Potential Future data deliveryGRIDftp based • For routine multiple usage customers • Pull – • Product ready • Email notifies user that data is ready • User using GRIDftp and user certificate for authentication provided access and pulls data • Push – • Account provided on remote system with host certificate and our user certificate • These GRID certificate establish Virtual Organization between the two parties • When data available is GRIDftp used to pushed data to remote system
Potential Future data deliveryGRIDftp based • For single usage customers Process to • Establish “Certificate process” with customer • Customer must install GridFTP client Currently seems too complex (not worth the effort) Would like to have simplified method such as • Email a one time use “user certificate” • Integrated with browser built in GRIDftp client