110 likes | 444 Views
Lightweight Replication of Heavyweight Data. Scott Koranda University of Wisconsin-Milwaukee & National Center for Supercomputing Applications. Heavyweight Data from LIGO. Sites at Livingston, LA (LLO) and Hanford, WA (LHO) 2 interferometers at LHO, 1 at LLO
E N D
Lightweight Replication ofHeavyweight Data Scott Koranda University of Wisconsin-Milwaukee & National Center for Supercomputing Applications www.griphyn.org
Heavyweight Data from LIGO • Sites at Livingston, LA (LLO) and Hanford, WA (LHO) • 2 interferometers at LHO, 1 at LLO • 1000’s of channels recorded at rates of 16 KHz, 16 Hz, 1 Hz,… • Output is binary ‘frame’ files holding 16 seconds data with GPS timestamp ~ 100 MB from LHO ~ 50 MB from LLO • ~ 1 TB/day in total • S1 run ~ 2 weeks • S2 run ~ 8 weeks 4 km LIGO interferometer at Livingston, LA www.griphyn.org
Networking to IFOs Limited • LIGO IFOs remote, making bandwidth expensive • Couple of T1 lines for email/administration only • Ship tapes to Caltech (SAM-QFS) • Reduced data sets (RDS) generated and stored on disk ~ 20 % size of raw data ~ 200 GB/day GridFedEx protocol www.griphyn.org
Replication to University Sites Cardiff MIT AEI UWM PSU CIT UTB www.griphyn.org
Why Bulk Replication to University Sites? • Each has compute resources (Linux clusters) • Early plan was to provide one or two analysis centers • Now everyone has a cluster • Cheap storage is cheap • $1/GB for drives • TB RAID-5 < $10K • Throw more drives into your cluster • Analysis applications read a lot of data • Different ways to slice some problems, but most want access to large sets of data for a particular instance of search parameters www.griphyn.org
LIGO Data Replication Challenge • Replicate 200 GB/day of data to multiple sites securely, efficiently, robustly (no babysitting…) • Support a number of storage models at sites • CIT → SAM-QFS (tape) and large IDE farms • UWM → 600 partitions on 300 cluster nodes • PSU → multiple 1 TB RAID-5 servers • AEI → 150 partitions on 150 nodes with redundancy • Coherent mechanism for data discovery by users and their codes • Know what data we have, where it is, and replicate it fast and easy www.griphyn.org
Prototyping “Realizations” • Need to keep “pipe” full to achieve desired transfer rates • Mindful of overhead of setting up connections • Set up GridFTP connection with multiple channels, tuned TCP windows and I/O buffers and leave it open • Sustained 10 MB/s between Caltech and UWM, peaks up to 21 MB/s • Need cataloging that scales and performs • Globus Replica Catalog (LDAP) < 105 and not acceptable • Need solution with relational database backend scales to 107 and fast updates/reads • No need for “reliable file transfer” (RFT) • Problem with any single transfer? Forget it, come back later… • Need robust mechanism for selecting collections of files • Users/sites demand flexibility choosing what data to replicate • Need to get network people interested • Do your homework, then challenge them to make your data flow faster www.griphyn.org
LIGO, err… Lightweight Data Replicator (LDR) • What data we have… • Globus Metadata Catalog Service (MCS) • Where data is… • Globus Replica Location Service (RLS) • Replicate it fast… • Globus GridFTP protocol • What client to use? Right now we use our own • Replicate it easy… • Logic we added • Is there a better solution? www.griphyn.org
Lightweight Data Replicator • Replicated 20 TB to UWM thus far • Just deployed at MIT, PSU, AEI • Deployment in progress at Cardiff • LDRdataFindServer running at UWM www.griphyn.org
Lightweight Data Replicator • “Lightweight” because we think it is the minimal collection of code needed to get the job done • Logic coded in Python • Use SWIG to wrap Globus RLS • Use pyGlobus from LBL elsewhere • Each site is any combination of publisher, provider, subscriber • Publisher populates metadata catalog • Provider populates location catalog (RLS) • Subscriber replicates data using information provided by publishers and providers • Take “Condor” approach with small, independent daemons that each do one thing • LDRMaster, LDRMetadata, LDRSchedule, LDRTransfer,… www.griphyn.org
Future? • LDR is a tool that works now for LIGO • Still, we recognize a number of projects need bulk data replication • There has to be common ground • What middleware can be developed and shared? • We are looking for “opportunities” • Code for “solve our problems for us…” • Want to investigate Stork, DiskRouter, ? • Do contact me if you do bulk data replication… www.griphyn.org