170 likes | 264 Views
Computing Sector, Fermi National Accelerator Laboratory. Experience with Globus O nline at Fermilab. Overview. Integration of Workload Management and Data Movement Systems with GO Center for Enabling Distributed Petascale Science (CEDPS): GO integration with glideinWMS
E N D
Computing Sector, Fermi National Accelerator Laboratory Experience with Globus Online at Fermilab GlobusWorld 2012: Experience with GO@Fermilab
Overview • Integration of Workload Management and Data Movement Systems with GO • Center for Enabling Distributed Petascale Science (CEDPS): GO integration with glideinWMS • Data Handling prototype for Dark Energy Survey (DES) • Performance tests of GO over 100 Gpbs networks • GO on the Advanced Network Initiative (ANI) testbed • Data Movement on OSG for end users • Network for Earthquake Engineering Simulation (NEES) GlobusWorld 2012: Experience with GO@Fermilab
Fermilab’s interest in GO • Data Movement service for end users • Supporting user communities on the Grid • Evaluating GO services in the workflows of our stakeholders • Data Movement service integration • Evaluate GO as a component of middleware systems e.g. Glidein Workload Management • Evaluate performance of GO for exa-scale networks (100 GE) GlobusWorld 2012: Experience with GO@Fermilab
1. CEDPS • CEDPS: The five year project 2006-2011, funded by Department of Energy (DOE) • Goals • Produce technical innovations for rapid and dependable data placement within a distributed high performance environment and for the construction of scalable science services for data and computing from many clients. • Address performance and functionality troubleshooting of these and other related distributed activities. • Collaborative Research • Mathematics & Computer Science Division, Argonne National Laboratory • Computing Division, Fermi National Accelerator Laboratory • Lawrence Berkeley National Laboratory • Information Sciences Institute, University of Southern California • Dept of Computer Science, University of Wisconsin Madison • Collaborative work done by Fermi National Lab, Argonne National Lab, University of Wisconsin • Supporting the integration of data movement mechanisms with scientific Glidein workload management system • Integration of asynchronous data stage-out mechanisms in overlay workload management systems GlobusWorld 2012: Experience with GO@Fermilab
glideinWMS • Pilot-based WMS that creates on demand a dynamically-sized overlay condor batch system on Grid resources to address the complex needs of VOs in running application workflows • User Communities • CMS • Communitties in the Fermilab • CDF • DZero • Intensity Frontier Experiments (Minos, Minerva, Nova …) • OSG Factory at UCSD & Indian Univ • Serves OSG VO Frontends, including ICECube, Engage, LSST, … • CoralWMS - Frontend for TeraGrid community • Atlas - Evaluating glideinWMS interfaced with Panda framework for their analysis framework • User community growing rapidly GlobusWorld 2012: Experience with GO@Fermilab
Glideinwms Scale of Operations CMS Factory@CERN serving ~400K jobs OSG Factory@UCSD serving ~200K jobs CMS Analysis Frontend@UCSD serving pool with ~25K jobs CMS Frontend@CERN serving pool with ~50K jobs CMS Production Factory (up) & Frontend at CERN OSG Factory & CMS Analysis at UCSD GlobusWorld 2012: Experience with GO@Fermilab
Integrating glideinWMS with GO glideinWMS Glidein Factory, WMS Pool • Goals: • Middleware handle data movement, rather than the application • Middleware optimize use of computing resources (CPU do not block on data movement) • Users provide data movement directives in the Job Description File (e.g. storage services for IO) • glideinWMS procures resources on the Grid and run jobs using Condor • Data movement is delegated to the underlying Condor system • globusconnect is instantiated and GO plug-in is invoked using the directives in the JDF • Condor optimizes resources VO Infrastructure VO Frontend Condor Central Manager Condor Scheduler Job Grid Site Worker Node glidein Condor Startd globusonline.org GlobusWorld 2012: Experience with GO@Fermilab
Validation Test Results • Tests – Modified Intensity Frontier experiment (Minerva) jobs to transfer output sandbox to GO endpoint using transfer plugin • Jobs: 2636, with 500 running at a time • Total files transferred: 16359 • Upto 500 dynamically created GO endpoints at a given time. • Lessons Learned • Integration tests successful with 95% transfer success rate -- stressing scalability of GO in an unintended way • GO team working on the scalability issues identified • Efficiency and scalability can be increased by modifying the plugin to reuse GO endpoints and by transferring multiple files at the same time. GlobusWorld 2012: Experience with GO@Fermilab
2. Prototype integration of GO with DES Data Access Framework • Motivation • Support Dark Energy Survey preparation for data taking • See Don Petravick’s talk on Wed • DES Data Access Framework (DAF) uses a network of GridFTP servers to reliably move data across sites. • In Mar 2011, we investigated the integration of DAF with GO to address 2 issues: • DAF data transfer parameters were not optimal for both small and large files. • Reliability was implemented inefficiently by sequentially verifying real file size with DB catalogue. GlobusWorld 2012: Experience with GO@Fermilab
Results and improvements • Tested DAF moving 31,000 files (184 GB) with GO vs. UberFTP • Results • Time for Transfer + Verification is the same (~100 min) • Transfer time is 27% faster with GO than with UberFTP • Verification time is 50% slower with GO than sequentially with UberFTP • Proposed Improvements: • Allow specification of src / dest transfer reliability semantics (e.g. same size, same CRC, etc.) – Implemented for size • Allow finer-grain failure model (e.g. specify number of transfer retrials instead of time deadline) • Provide interface for efficient (pipelined) lsof src / destfiles. GlobusWorld 2012: Experience with GO@Fermilab
3. GO on the ANI Testbed • Motivation:Testing Grid middleware readiness to interface 100 Gbits links on the Advanced Network Initiative (ANI) Testbed. • Characteristics: • GridFTPdata transfers (small, medium, large, all sizes) • 300GB of data split into 42432 files (8KB – 8GB) • Network: aggregate 3 x 10Gbit/s to bnl-1 test machine • Local tests(reference) initiated on bnl-1 • FNAL and GO tests:initiated on “FNAL initiator”; GridFTPcontrol forwarded through “VPN gateway” Work by Dave Dykstra w/ contrib. by Raman Verma& Gabriele Garzoglio 11 GlobusWorld 2012: Experience with GO@Fermilab
Test results • GO (yellow) does almost as well as practical max (red) for medium-size files. • Working with GO to improve transfer parameters for big and small files. • Small files have very high overhead over wide area control channels • GO auto-tuning works better for medium files than for the large files • Counterintuitively, increasing concurrency and pipelining on small files reduced the transfer throughput. 12 Work by Dave Dykstra w/ contrib. by Raman Verma& Gabriele Garzoglio GlobusWorld 2012: Experience with GO@Fermilab
4. Data Movement on OSG for NEES A. R. Barbosa, J. P. Conte, J. I. Restrepo, UCSD • Motivation • supporting NEES group at UCSD to run computations on the Open Science Grid (OSG) • Goal • Perform parametric studies that involve large-scale nonlinear models of structure or soil-structure systems with large number of parameters and OpenSees runs. • Application example • nonlinear time-history (NLTH) analyses of advanced nonlinear finite element (FE) model of a building • Probabilistic seismic demand hazard analysis making use of the “cloud method”: 90 bi-directional historical earthquake record • Sensitivity of probabilistic seismic demand to FE model parameters 30 days on OSG vs. 12 yrs on Desktop GlobusWorld 2012: Experience with GO@Fermilab
Success and challenges • Jobs submitted from RENCI (NC) to ~ 20 OSG sites. Output collected at RENCI. • NEES scientist moved 12 TB from the RENCI server to the user’s desktop at UCSD using GO • Operations: every day, set up the data transfer update for the day: fire and forget …almost… • …there is still no substitute for a good network administrator • Initially, we had 5 Mbps eventually 200 Mbps (over 600 Mbps link). Improvements: • Upgrade eth card on user desktop • Migrate from Windows to Linux • Work with the user to use GO • Find a good net admin to find and fix broken fiber at RENCI, when nothing else worked. • Better use of GO on OSG: Integrate GO with the Storage Resource Broker (SRM) GlobusWorld 2012: Experience with GO@Fermilab
Conclusions • Fermilab has worked with the GO team to improve the system for several use cases: • Integration with glidein Workload Management – Stress the “many-globusconnect” dimension • Integration with Data Handling for DES – New requirements on reliability semantics • Evaluation of performance over 100 Gbps networks – Verify transfer parameters auto-tuning at extreme scale • Integrate GO with NEES for regular operations on OSG – Usability for GO’s intended usage GlobusWorld 2012: Experience with GO@Fermilab
Acknowledgments • GlobusOnline team for their support in all of these activities. • Integration of Glideinwms and globusonline.org was done as a part of CEDPS project • glideinWMS infrastructure is developed in Fermilab in collaboration with the Condor team from Wisconsin and High Energy Physics experiments. • Most of the glideinWMS development work is funded by USCMS (part of CMS) experiment. • Currently used in production by CMS, CDF and DZero, MINOS, ICECube with several other VOS evaluating it for their use case. • The Open Science Grid (OSG) • Fermilabis operated by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the United States Department of Energy. GlobusWorld 2012: Experience with GO@Fermilab
References • CEDPS Report: GO Stress Test Analysis • https://cd-docdb.fnal.gov:440/cgi-bin/RetrieveFile?docid=4474;filename=GlobusOnline%20PluginAnalysisReport.pdf;version=1 • DES DAF Integration with GO • https://www.opensciencegrid.org/bin/view/Engagement/DESIntegrationWithGlobusonline • GridFTP & GO on the ANI Testbed • https://docs.google.com/document/d/1tFBg7QVVFu8AkUt5ico01vXcFsgyIGZH5pqbbGeI7t8/edit?hl=en_US&pli=1 • OSG User Support of NEES • https://www.opensciencegrid.org/bin/view/Engagement/EngageOpenSeesProductionDemo GlobusWorld 2012: Experience with GO@Fermilab