STORK: A Scheduler for Data Placement Activities in Grid

STORK: A Scheduler for Data Placement Activitiesin Grid Tevfik Kosar University of Wisconsin-Madison kosart@cs.wisc.edu

Some Remarkable Numbers Characteristics of four physics experiments targeted by GriPhyN: Source: GriPhyN Proposal, 2000

Even More Remarkable… “ ..the data volume of CMS is expected to subsequently increase rapidly, so that the accumulated data volume will reach 1 Exabyte (1 million Terabytes) by around 2015.” Source: PPDG Deliverables to CMS

Other Data Intensive Applications • Genomic information processing applications • Biomedical Informatics Research Network (BIRN) applications • Cosmology applications (MADCAP) • Methods for modeling large molecular systems • Coupled climate modeling applications • Real-time observatories, applications, and data-management (ROADNet)

Need to Deal with Data Placement • Data need to be moved, staged, replicated, cached, removed; storage space for data should be allocated, de-allocated. • We call all of these data related activities in the Grid as Data Placement (DaP) activities.

State of the Art • Data placement activities in the Grid are performed either manually or by simple scripts. • Data placement activities are simply regarded as “second class citizens” of the computation dominated Grid world.

Our Goal • Our goal is to make data placement activities “first class citizens” in the Grid just like the computational jobs! • They need to be queued, scheduled, monitored and managed, and even checkpointed.

Outline • Introduction • Grid Challenges • Stork Solutions • Case Study: SRB-UniTree Data Pipeline • Conclusions & Future Work

Grid Challenges • Heterogeneous Resources • Limited Resources • Network/Server/Software Failures • Different Job Requirements • Scheduling of Data & CPU together

Stork • Intelligently & reliably schedules, runs, monitors, and manages Data Placement (DaP) jobs in a heterogeneous Grid environment & ensures that they complete. • What Condor means for computational jobs, Stork means the same for DaP jobs. • Just submit a bunch of DaP jobs and then relax..

Stork Solutions to Grid Challenges • Specialized in Data Management • Modularity & Extendibility • Failure Recovery • Global & Job Level Policies • Interaction with Higher Level Planners/Schedulers

Already Supported URLs • file:/ -> Local File • ftp:// -> FTP • gsiftp:// -> GridFTP • nest:// -> NeST (chirp) protocol • srb:// -> SRB (Storage Resource Broker) • srm:// -> SRM (Storage Resource Manager) • unitree:// -> UniTree server • diskrouter:// -> UW DiskRouter

SRM SRB NeST Higher Level Planners DAGMan Condor-G (compute) Stork (DaP) Gate Keeper StartD RFT GridFTP

Interaction with DAGMan Condor Job Queue A Job A A.submit DaP X X.submit Job C C.submit Parent A child C, X Parent X child B ….. DAGMan A Stork Job Queue X X C B Y D

Sample Stork submit file [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… …… Max_Retry = 10; Restart_in = “2 hours”; ]

Case Study: SRB-UniTree Data Pipeline • We have transferred ~3 TB of DPOSS data (2611 x 1.1 GB files) from SRB to UniTree using 3 different pipeline configurations. • The pipelines are built using Condor and Stork scheduling technologies. The whole process is managed by DAGMan.

Submit Site SRB Server 1 UniTree Server SRB get UniTree put NCSA Cache

Submit Site SRB Server 2 UniTree Server SRB get GridFTP UniTree put SDSC Cache NCSA Cache

Submit Site SRB Server 3 UniTree Server SRB get DiskRouter UniTree put SDSC Cache NCSA Cache

Outcomes of the Study 1. Stork interacted easily and successfully with different underlying systems: SRB, UniTree, GridFTP and Diskrouter.

Outcomes of the Study (2) 2. We had the chance to compare different pipeline topologies and configurations:

Outcomes of the Study (3) 3. Almost all possible network, server, and software failures were recovered automatically.

Failure Recovery Diskrouter reconfigured and restarted UniTree not responding SDSC cache reboot & UW CS Network outage SRB server maintenance

For more information on the results of this study, please check: http://www.cs.wisc.edu/condor/stork/

Conclusions • Stork makes data placement a “first class citizen”. • Stork is the Condor of data placement world. • Stork is fault tolerant, easy to use, modular, extendible, and very flexible.

Future Work • More intelligent scheduling • Data level management instead of file level management • Checkpointing for transfers • Security

You don’t have to FedEx your data anymore.. Stork delivers it for you! • For more information • Drop by my office anytime • Room: 3361, Computer Science & Stats. Bldg. • Email to: • kosart@cs.wisc.edu

STORK: A Scheduler for Data Placement Activities in Grid