1 / 28

Wide Area Data Replication for Scientific Collaborations

This paper discusses the development of a higher-level Grid data management service, the Data Replication Service (DRS), which aims to provide an application-independent solution for managing and replicating large datasets in scientific collaborations. The paper also highlights the use of the Lightweight Data Replicator (LDR) system as the basis for DRS functionality and its implementation in a wide area Grid environment. Additionally, it presents a case study on the Laser Interferometer Gravitational Wave Observatory (LIGO) project to demonstrate the effectiveness of the DRS.

danab
Download Presentation

Wide Area Data Replication for Scientific Collaborations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wide Area Data Replication for Scientific Collaborations Ann Chervenak, Robert Schuler, Carl Kesselman USC Information Sciences Institute Scott Koranda Univa Corporation Brian Moe University of Wisconsin Milwaukee

  2. Motivation • Scientific application domains spend considerable effort managing large amounts of experimental and simulation data • Have developed customized, higher-level Grid data management services • Examples: • Laser Interferometer Gravitational Wave Observatory (LIGO) Lightweight Data Replicator System • High Energy Physics projects: EGEE system, gLite, LHC Computing Grid (LCG) middleware • Portal-based coordination of services (E.g., Earth System Grid)

  3. Motivation (cont.) • Data management functionality varies by application • Share several requirements: • Publish and replicate large datasets (millions of files) • Register data replicas in catalogs and discover them • Perform metadata-based discovery of datasets • May require ability to validate correctness of replicas • In general, data updates and replica consistency services not required (i.e., read-only accesses) • Systems provide production data management services to individual scientific domains • Each project spends considerable resources to design, implement & maintain data management system • Typically cannot be re-used by other applications

  4. Motivation (cont.) • Long-term goals: • Generalize functionality provided by these data management systems • Provide suite of application-independent services • Paper describes one higher-level data management service: the Data Replication Service (DRS) • DRS functionality based on publication capability of the LIGO Lightweight Data Replicator (LDR) system • Ensures that a set of files exists on a storage site • Replicates files as needed, registers them in catalogs • DRS builds on lower-level Grid services, including: • Globus Reliable File Transfer (RFT) service • Replica Location Service (RLS)

  5. Outline • Description of LDR data publication capability • Generalization of this functionality • Define characteristics of an application-independent Data Replication Service (DRS) • DRS Design • DRS Implementation in GT4 environment • Evaluation of DRS performance in a wide area Grid • Related work • Future work

  6. A Data-Intensive Application Example: The LIGO Project • Laser Interferometer Gravitational Wave Observatory (LIGO) collaboration • Seeks to measure gravitational waves predicted by Einstein • Collects experimental datasets at two LIGO instrument sites in Louisiana and Washington State • Datasets are replicated at other LIGO sites • Scientists analyze the data and publish their results, which may be replicated • Currently LIGO stores more than 40 million files across ten locations

  7. The Lightweight Data Replicator • LIGO scientists developed the Lightweight Data Replicator (LDR) System for data management • Built on top of standard Grid data services: • Globus Replica Location Service • GridFTP data transport protocol • LDR provides a rich set of data management functionality, including • a pull-based model for replicating necessary files to a LIGO site • efficient data transfer among LIGO sites • a distributed metadata service architecture • an interface to local storage systems • a validation component that verifies that files on a storage system are correctly registered in a local RLS catalog

  8. LIGO Data Publication and Replication Two types of data publishing 1. Detectors at Livingston and Hanford produce data sets • Approx. a terabyte per day during LIGO experimental runs • Each detector produces a file every 16 seconds • Files range in size from 1 to 100 megabytes • Data sets are copied to main repository at CalTech, which stores them in tape-based mass storage system • LIGO sites can acquire copies from CalTech or one another 2. Scientists also publish new or derived data sets as they perform analysis on existing data sets • E.g., data filtering or calibration may create new files • These new files may also be replicated at LIGO sites

  9. Some Terminology • A logical file name (LFN) is a unique identifier for the contents of a file • Typically, a scientific collaboration defines and manages the logical namespace • Guarantees uniqueness of logical names within that organization • A physical file name (PFN) is the location of a copy of the file on a storage system. • The physical namespace is managed by the file system or storage system • The LIGO environment currently contains: • More than six million unique logical files • More than 40 million physical files stored at ten sites

  10. Components at Each LDR Site • Local storage system • GridFTP server for file transfer • Metadata Catalog: associations between logical file names and metadata attributes • Replica Location Service: • Local Replica Catalog (LRCs) stores mappings from logical names to storage locations • Replica Location Index (RLI) collects state summaries from LRCs • Scheduler and transfer daemons • Prioritized queue of requested files

  11. LDR Data Publishing • Scheduling daemon runs at each LDR site • Queries site’s metadata catalog to identify logical files with specified metadata attributes • Checks RLS Local Replica Catalog to determine whether copies of those files already exist locally • If not, puts logical file names on priority-based scheduling queue • Transfer daemon also runs at each site • Checks queue and initiates data transfers in priority order • Queries RLS Replica Location Index to find sites where desired files exists • Randomly selects source file from among available replicas • Use GridFTP transport protocol to transfer file to local site • Registers newly-copied file in RLS Local Replica Catalog

  12. Generalizing the LDR Publication Scheme • Want to provide a similar capability that is • Independent of LIGO infrastructure • Useful for a variety of application domains • Capabilities include: • Interface to specify which files are required at local site • Use of Globus RLS to discover whether replicas exist locally and where they exist in the Grid • Use of a selection algorithm to choose among available replicas • Use of Globus Reliable File Transfer service and GridFTP data transport protocol to copy data to local site • Use of Globus RLS to register new replicas

  13. Relationship to Other Globus Services At requesting site, deploy: • WS-RF Services • Data Replication Service • Delegation Service • Reliable File Transfer Service • Pre WS-RF Components • Replica Location Service (Local Replica Catalog, Replica Location Index) • GridFTP Server

  14. DRS Functionality • Initiate a DRS Request • Create a delegated credential • Create a Replicator resource • Monitor Replicator resource • Discover replicas of desired files in RLS, select among replicas • Transfer data to local site with Reliable File Transfer Service • Register new replicas in RLS catalogs • Allow client inspection of DRS results • Destroy Replicator resource DRS implemented in Globus Toolkit Version 4, complies with Web Services Resource Framework (WS-RF)

  15. Service State Management: Resource Resource Property State Identification: Endpoint Reference State Interfaces: GetRP, QueryRPs, GetMultipleRPs, SetRP Lifetime Interfaces: SetTerminationTime ImmediateDestruction Notification Interfaces Subscribe Notify ServiceGroups Resource RPs WSRF in a Nutshell Service GetRP GetMultRPs EPR EPR EPR SetRP QueryRPs Subscribe SetTermTime Destroy

  16. Credential RP Service Container Create Delegated Credential Data Rep. Client • Create delegated credential resource • Set termination time Replica Catalog Replica Index • Credential EPR returned RFT EPR Replica Catalog proxy Replica Catalog Replica Catalog Delegation • Initialize user proxy cert. GridFTP Server GridFTP Server MDS

  17. Replicator Credential RP RP Service Container Create Replicator Resource Data Rep. Client EPR Replica Catalog Replica Index • Create Replicator resource • Pass delegated credential EPR • Set termination time RFT Replica Catalog Replica Catalog Replica Catalog • Replicator EPR returned Delegation • Access delegated credential resource GridFTP Server GridFTP Server MDS

  18. Index Replicator Credential RP RP RP Service Container Monitor Replicator Resource Data Rep. Client Replica Catalog Replica Index • Subscribe to ResourceProperty changes for “Status” RP and “Stage” RP RFT Replica Catalog Replica Catalog Replica Catalog Delegation • Add Replicator resource to MDS Information service Index EPR • Periodically polls Replicator RP via GetRP or GetMultRP GridFTP Server GridFTP Server MDS • Conditions may trigger alerts or other actions (Trigger service not pictured)

  19. Index Replicator Credential RP RP RP Service Container Query Replica Information Data Rep. Client Replica Catalog • Notification of “Stage” RP value changed to “discover” Replica Index RFT Replica Catalog • Replicator queries RLS Replica Index to find catalogs that contain desired replica information Replica Catalog Replica Catalog • Replicator queries RLS Replica Catalog(s) to retrieve mappings from logical name to target name (URL) Delegation GridFTP Server GridFTP Server MDS

  20. Credential Transfer Index Replicator RP RP RP RP Service Container Transfer Data • Periodically poll “ResultStatus” RP via GetRP • When “Done”, get state information for each file transfer Data Rep. Client Replica Catalog EPR • Notification of “Stage” RP value changed to “transfer” Replica Index EPR RFT Replica Catalog • Data transfer between GridFTP Server sites • Create Transfer resource • Pass credential EPR • Set Termination Time • Transfer resource EPR returned Replica Catalog Replica Catalog Delegation GridFTP Server GridFTP Server • Access delegated credential resource MDS • Setup GridFTP Server transfer of file(s)

  21. Credential Transfer Index Replicator RP RP RP RP Service Container Register Replica Information Data Rep. Client Replica Catalog • Notification of “Stage” RP value changed to “register” Replica Index RFT Replica Catalog Replica Catalog Replica Catalog • Replicator registers new file mappings in RLS Replica Catalog Delegation • RLS Replica Catalog sends update of new replica mappings to the Replica Index GridFTP Server GridFTP Server MDS

  22. Transfer Index Replicator Credential RP RP RP RP Service Container Client Inspection of State Data Rep. Client Replica Catalog • Client inspects Replicator state information for each replication in the request • Notification of “Status” RP value changed to “Finished” Replica Index RFT Replica Catalog Replica Catalog Replica Catalog Delegation GridFTP Server GridFTP Server MDS

  23. Transfer Index Replicator Credential RP RP RP RP Service Container Resource Termination TIME Data Rep. Client Replica Catalog • Termination time (set by client) expires eventually Replica Index RFT Replica Catalog Replica Catalog Replica Catalog • Resources destroyed (Credential, Transfer, Replicator) Delegation GridFTP Server GridFTP Server MDS

  24. Performance Measurements: Wide Area Testing • The destination for the pull-based transfers is located in Los Angeles • Dual-processor, 1.1 GHz Pentium III workstation with 1.5 GBytes of memory and a 1 Gbit Ethernet • Runs a GT4 container and deploys services including RFT and DRS as well as GridFTP and RLS • The remote site where desired data files are stored is located at Argonne National Laboratory in Illinois • Dual-processor, 3 GHz Intel Xeon workstation with 2 gigabytes of memory with 1.1 terabytes of disk • Runs a GT4 container as well as GridFTP and RLS services

  25. DRS Operations Measured • Create the DRS Replicator resource • Discover source files for replication using local RLS Replica Location Index and remote RLS Local Replica Catalogs • Initiate an Reliable File Transfer operation by creating an RFT resource • Perform RFT data transfer(s) • Register the new replicas in the RLS Local Replica Catalog

  26. Experiment 1: Replicate 10 Files of Size 1 Gigabyte Component of Operation Time (milliseconds) Create Replicator Resource 317.0 Discover Files in RLS 449.0 Create RFT Resource 808.6 Transfer Using RFT 1186796.0 Register Replicas in RLS 3720.8 • Data transfer time dominates • Wide area data transfer rate of 67.4 Mbits/sec

  27. Experiment 2: Replicate 1000 Files of Size 10 Megabytes Component of Operation Time (milliseconds) Create Replicator Resource 1561.0 Discover Files in RLS 9.8 Create RFT Resource 1286.6 Transfer Using RFT 963456.0 Register Replicas in RLS 11278.2 • Time to create Replicator and RFT resources is larger • Need to store state for 1000 outstanding transfers • Data transfer time still dominates • Wide area data transfer rate of 85 Mbits/sec

  28. Future Work • We will continue performance testing of DRS: • Increasing the size of the files being transferred • Increasing the number of files per DRS request • Add and refine DRS functionality as it is used by applications • E.g., add a push-based replication capability • We plan to develop a suite of general, configurable, composable, high-level data management services

More Related