1 / 35

Computing Strategy DOE Annual Review Patricia McBride Fermilab Computing Division

Computing Strategy DOE Annual Review Patricia McBride Fermilab Computing Division September 26, 2007. DOE Annual Program Review. Computing is a fundamental part of the laboratory infrastructure and the scientific program. Three main components: Core Computing

rocio
Download Presentation

Computing Strategy DOE Annual Review Patricia McBride Fermilab Computing Division

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Strategy DOE Annual Review Patricia McBride Fermilab Computing Division September 26, 2007

  2. DOE Annual Program Review • Computing is a fundamental part of the laboratory infrastructure and the scientific program. • Three main components: • Core Computing • Computing in support of the Scientific Program • Computing R&D • Strategy: • Support the scientific program with state of the art computing facilities • Maximize connections between the experiments and computer facilities • Research and development aimed at scientific discovery and the facilities of tomorrow P L McBride Computing Strategy

  3. Computing in Support of the Scientific Program • Broad scientific program with diverse needs: • Run II, CMS (and the LHC),Astrophysics, Neutrino program, Theory, Accelerator • Future program: ILC, Project X, Nova, SNAP • Challenges: • Experiments require more and more computing resources • Experiments and Users are spread over the globe. • Look for common solutions: • GRID so resources can be shared • Common storage solutions (expertise from Run II) • Connectivity between global computing resources is increasingly important. • We need to move data seamlessly throughout widely distributed computing systems. • To accomplish this takes experience, expertise, and R&D • Fermilab has become a leader in HEP computing facilities through Run II experience and continues this tradition with the CMS Tier-1 center. P L McBride Computing Strategy

  4. Facilities - Current Status • Computing Division operates computing hardware and provides and manages needed computer infrastructure, i.e., space, power & cooling • Computer rooms are located in 3 different buildings (FCC, LCC and GCC) • Mainly 4 types of hardware: • Computing Box(Multi-CPU, Multi-Core) • Disk Server • Tape robot with tape drive • Network equipment P L McBride Computing Strategy

  5. Planning for future facilities • The scientific program will continue to demand increasing amounts of computing power and storage. • More computers are purchased (~1,000/yr) • Power required per new computer increases • More computers per sq. ft. of floor space • Advances in computing may not be as predictable in the future. • Multi-core processors of the future may not be so easy to exploit for our applications. • R&D in these area may be needed. • R&D will be needed to learn to cope with increasing demands in the next decade. P L McBride Computing Strategy

  6. Facilities - Future • To improve sufficient provisioning of computing power, electricity and cooling, following new developments are under discussion: • Water cooled racks (instead of air cooled racks) • Blade server designs • Vertical arrangement of server units in rack • Common power supply instead of individual power supplies per unit • higher density, lower power consumption • Multi-Core Processors due to smaller chip manufacturing processes • Same computing power at reduced power consumption • Major topic of discussion at CHEP in Sept. ‘07 P L McBride Computing Strategy

  7. Strategies to support Run II Computing • Computing Division FTEs dedicated to Run II Computing will be held constant through FY08 and increase by 2 FTEs in FY09. • 2 Application Physicists were hired in FY07 explicitly for Computing Operations • Scientist leadership expected to remain in place through FY09 • Continuing Guest Scientist and visitor positions • Increased use of shared scientific services (common across Run II, CMS, MINOS, others) • This has been a fairly successful strategy for the past 3+ years • Use of Grid resources for computing • All computational resources at Fermilab are part of FermiGrid and are thus available for sharing in times of low use by the primary customer • Being Grid-enabled allows for opportunistic use of computing resources world-wide and will facilitate large Run II special processing efforts. • Experiments and Computing Division (CD) are undertaking campaigns together to improve automation and operational efficiency P L McBride Computing Strategy

  8. FermiGrid: a strategy for success • A set of common services for the Fermilab site: • The site Globus gateway. • The site Virtual Organization support services (VOMRS and VOMS). • The site Grid User Mapping Service (GUMS) which routinely handles well in excess of 500K transactions/day. • The Site AuthoriZation Service (SAZ) which allows us to implement a consistent and enforceable site security policy (whitelist and blacklist). • The public interface between the Open Science Grid and Fermilab computing resources: • Also provides interoperation with EGEE and LCG. • Approximately 10% of our total computing resources have been opportunistically used by members of various OSG Virtual Organizations. • A consistent interface to much of our disparate collections of compute and storage resources: • CDF, D0 and GPFARM clusters & worker nodes. • SRM, dcache, STKen. P L McBride Computing Strategy

  9. Data Handling: Tape systems Fermilab CD has lots of operational experience with tape systems. Operational success is achieved through frequent interactions with stakeholders. Assure integrity of archived data through checksums, reading of unaccessed tapes, migration to newer media, and other measures. Tape systems are a mature technology. Improvements will come from increase in size of tapes and robot technology • Data handling for experiments: • Storage: “active library-style” archiving on tapes in tape robots • Access: disk based system (dCache) to cache sequential/random access patterns to archived data samples P L McBride Computing Strategy

  10. Central Storage Management Central Storage aims to reduce cost of user storage. • Fileserver consolidation • BlueArc NAS filers • Tiered Storage • HDS, 3PAR, Nexsan • Thin Provisioning • Dual SAN fabric • 272 ports • 32 Storage arrays • ~800TB raw total NAS Storage Growth Year 1

  11. Networking for FNAL Strategy for networking: • Deploy additional capacity at incremental cost through over-provisioned fiber plant on-site and leased dark fiber off-site. • Build and share expertise in high-performance data movement from application to application. Data transfers in preparation for CMS data operations: • In the last two years, outbound traffic from Fermilab has grown from 94.3 TB/month (July 2005) to 2.15 PB/month (July 2007). CMS exercise "CSA06" accounts for the bump in June/July 2006. Full scale = 2.5 PB/month P L McBride Computing Strategy

  12. Distributed Computing: Networking and data transfers • Computing on the GRID requires excellent and reliable network connections. • Network research insights have proved invaluable for real-world CMS data movement performance tuning and problem solving. • Fermilab network and storage teams provide expertise for CMS data operations. • Over a 90-day period of mock-production data transfer testing for CMS, 77% of all data delivered from CMS Tier-1 centers was delivered by Fermilab. Full scale = 5 PB CMS-Computing: Tier structure: 20% T0 at CERN, 40% at T1s and 40% at T2s Fermilab is the largest CMS T1 P L McBride Computing Strategy

  13. Open Science Grid • Fermilab is a major contributor in world-wide Grid computing and a consortium member and leader in Open Science Grid • OSG is a collaboration between physics (high energy, astro, nuclear, gravitational wave-LIGO), computer science, IT facilities and non-physical sciences. • The OSG Project is funded cross-agency - DOE SciDAC-2 and NSF - for 5 years (from 10/06) for 33FTE of effort: • Fermilab staff members have project leadership roles: • Ruth Pordes - Executive Director, • Don Petravick - Security Officer, • Chris Green - co-leader of the Users Group, • Ted Hesselroth - Storage Software Coordinator. • OSG currently provides access to 77 compute clusters (~30,000 cores) and 15 disk caches (~4PetaBytes), as well as mass storage at BNL, LBNL, Fermilab. Throughput currently 60K application jobs/day and 10K CPU days/day. Overall success rate on grid ~80%. P L McBride Computing Strategy

  14. Run II already benefits from OSG Event Throughput - D0 Reprocessing • D0 used Grid accessible farms in the US and South America (on OSG) and Europe “opportunistically” for more than half of their full dataset reprocessing Feb-May 2007. • More than 12 Lab and University clusters on OSG sites were used; processed 286 million event and transferred 70 TB data from/to Fermilab. • CDF have grid enabled all Monte Carlo production. D0 Throughput on OSG (CPU hours/week) P L McBride Computing Strategy

  15. Fermilab CMS Tier 1 facility • Fermilab’s T1 facility is in the 3rd year of a 4 year procurement period and within budget • Number of CPU doubled to ~900 nodes corresponding to 5.5 MSI2k • Disk space increased to 1.7 PByte (one of the world largest dCache installations) • Wide area network connection currently 20 Gbit/s August August Import to FNAL, peak more than 250MB/s a day Export form FNAL, peak more than 1GB/s a day from O. Gutsche (CMS)

  16. Fermilab T1 facility usage • Fermilab’s contributes significantly to the overall CMS computing • Major contribution to MC production (own production and archive of samples produced on US T2) • Major contribution to standard operations (re-reconstruction and skimming, etc.) • User analysis contribution goes beyond T1 facility • Large user analysis activity not only on the T1 facility • LPC-CAF used extensively by Fermilab, US and international collaborators for various analysis purposes • Operation and extension of facility manpower extensive • Admin staff continuously maintains the systems • Scaling issues frequently arise while increasing size • 4 year ramp-up plan helps solving scaling problems in a timely manner • Strong support in the future required for successful operation August successful Production jobs (dark green): more than 100,000 in August August successful Analysis jobs (dark green): more than 50,000 in August from O. Gutsche (CMS)

  17. LHC@FNAL: remote operations • Remote operations for CMS and the LHC has been an R&D effort for several years. • Collaborated with CERN to set the scope of remote operations and the development plan • Expect to participate in CMS shifts, DQM, data operation and LHC studies. • Established a collaboration with plasma physics community. • Joint proposal submitted to SciDAC(not funded). • Collaborative tools are an important part of remote operations including high quality communication equipment. • FNAL development efforts for remote operations: Role-Based Access (RBAC) for the LHC, Screen Shot Service (already in use by CDF and CMS global runs) • Working with the ILC community to develop plans for remote operations capabilities. RBAC: 1 CD developer to CERN for 9 months SSS: developer at FNAL (~0.3 FTE) P L McBride Computing Strategy

  18. GEANT4 Development @ FNAL Fermilab Joined the Geant4 Collaboration in June 2007. Contributions to time performance & reliability (M. Fischer, J. Kowalkowski, M. Paterno) and hadronic physics (S. Banerjee, D. Elvira, J. Yarba). This work has already resulted in: • 5% improvement to the G4 code • Validation of low energy p/p – thin target data to be used in a re-parameterization of the Low Energy Parameterization. • Re-design of the repository of validation results, collection of information The Fermilab group also has responsibilities in the CMS simulation group management, and it is the core of the CMS simulation infrastructure development group. CMS Simulation at the LHC Physics Center P L McBride Computing Strategy

  19. Computing for the ILC • Detector Simulations (~3 FTE) • CD has established detector simulation activities, working with PPD, other labs and universities. This effort will grow and evolve to meet the needs of the detector studies. • We plan to focus on infrastructure, tools in addition. • A combination of computer professionals and physicists will be involved in this effort. • Tools include simulation package support, algorithm development, code repository, and other expertise. GEANT4 support from CD will aid in the efforts. • Accelerator simulations (~2 FTE) • CD will provide accelerator simulation tools and accelerator simulations for the ILC as part of the APC. • Contributions include the Synergia and related tools and work on the linac studies. • Damping ring studies are ramping down for now. • Other efforts will be taken up as required. P L McBride Computing Strategy

  20. Computing for the ILC • Large scale simulation support (will be 1-2 FTE) • There is a proposal to provide large scale facilities for tightly coupled calculations for the ILC and other scientific efforts at the lab. One possible application is the simulation of the ILC RF system for the input coupler and high-order-mode couplers for the cavities. This facility could provide support for many applications that require tightly coupled parallel computing, including accelerator simulations. • Test beam (1-2FTE for computing; 1-2 engineering) • CD is planning to provide support for data storage and analysis for test beam studies for the ILC. The support is similar to the type of support provided for the experimental program, and it is an important service for the detector R&D groups without access to significant computing resources. • Accelerator Controls (~5-7 FTE from CD/ ~5 from AD) • Includes LLRF, Test facility controls, (high availability, timing), RDR, EDR P L McBride Computing Strategy

  21. Accelerator Simulations: SciDAC2 and COMPASS • The COMPASS collaboration won a SciDAC2 award in April of 2007. • Project funded by HEP, NP, BES, and ASCR, at ~$3M/year for 5 years • COMPASS is the successor of the AST SciDAC1 project • Includes more activities & participants • Panagiotis Spentzouris from Fermilab is the PI for COMPASS. (See his presentation for more details.) The Community Petascale Project for Accelerator Science and Simulation collaboration P L McBride Computing Strategy

  22. Computing in ILC Accelerator R&D Successful test of ILC Low Level RF control at DESY-FLASH Sept. 2007 Joint Computing Division / Accelerator Division Team 10 channel LLRF controller noise measurements SFDR= -81.8db Superconducting RF 8-cavity vectorsum gradient measurements Field regulation: 0.006% Phase regulation: 0.042º • Noise measured using cavity probe splits while the existing DESY LLRF was controlling the cryomodule. Noise figures include cavity probes, cables, analog and digital electronics. • RMS error: • Amplitude: -84dB. • Phase: 0.042 degrees. P L McBride Computing Strategy

  23. Computing for Astrophysics The computing strategy is to leverage existing infrastructure (tape robots,Fermigrid, etc) already in place for CDF/D0 and CMS Fermilab is hosting a public archive for SDSS. plus two copies as backups in storage. Resource planning for Astrophysics Experiments 5 years P L McBride Computing Strategy

  24. SNAP Instrument Electronics R&D • Reading a billion pixels in space --> computing R&D • Fermilab (CD) has contributed to the definition of the partitioned DAQ architecture for SNAP. • Designed a firmware development board and a test stand to test communications, instrument control, data compression and data communications capabilities. Flash Memory Test Board DAQ Slice Firmware Development Board Flash Memory Flash Memory (under test) FNAL flash memory test system for SNAP FPGA (Actel A3P1000) SDRAM P L McBride Computing Strategy

  25. Lattice QCD facility Fermilab is a member of the SciDAC-2 Computational Infrastructure for LQCD Project The facility is distributed at 3 labs: • Custom built QCDOC at BNL • Specially configured clusters at Jlab, Fermilab • At Fermilab: • “QCD” (2004) – 128 processors coupled with a Myrinet 2000 network, sustaining 150 GFlop/sec • “Pion” (2005) – 520 processors coupled with an Infiniband fabric, sustaining 850 GFlop/sec • “Kaon” (2006) – 2400 processor cores coupled with an Infiniband fabric, sustaining 2.56 TFlop/sec Lots of scientific papers have been produced using this facility. P L McBride Computing Strategy

  26. Lattice QCD - plans • As part of the DOE 4-year USQCD project, Fermilab is scheduled build: • a 4.2 TFlop/sec system in late 2008 • a 3.0 TFlop/sec system in late 2009 • Software projects: • new and improved libraries for LQCD computations • multicore optimizations • automated workflows • reliability and fault tolerance • visualizations TOP 500 Supercomputer Kaon P L McBride Computing Strategy

  27. Cosmological Computing - Plans • CD currently maintains a small 8 core cluster for cosmology • Other groups have many more resources: • Virgo Consortium: 670 CPU SparcIII, 816 CPU Power4 • ITC, Harvard: 316 CPU Opteron, 264 CPU Athlon • LANL: 294 CPU Pentium4 (just for cosmology) • CITA: 270 CPU Xeon cluster • SLAC: 72 CPU SGI Altrix, 128 CPU Xeon cluster • Princeton: 188 CPU Xeon cluster • UWash: 64 CPU Pentium cluster • Using FRA grant and contribution from KICP (UC) and Fermilab,CD will host/maintain a 560 core cluster for cosmology by December. • Gnedin (FNAL/PPD) developed the Adaptive Refinement Tree (ART) code • Implementation of Adaptive • Mesh Refinement method • Refines on a cell-by-cell basis • Full 4D adaptive • Includes • - dark matter dynamics • - gas dynamics and chemistry • - radiative transfer • - star formation, etc • The new cluster is crucial to extract fundamental physics from astrophysical observations (SDSS, DES, SNAP)→ Complements/enhances experimental program P L McBride Computing Strategy

  28. Cosmological Computing Cluster • Modern state-of-the art cosmological simulations require even more inter-communication between processes than Lattice QCD and: • ≥ 100,000 CPU-hours (130 CPU-months). Biggest ones take > 1,000,000 CPU-hours. • computational platforms with wide (multi-CPU), large-memory nodes. • New cluster plans based on: • “Computational Cosmology Initiative: Task Force Report” • CD participating in a Task Force to unify cosmological computing on a national scale • Equipment for cluster • AMD Barcelona, 552 Cores • Cosmological calculations involve substantial amounts of data: • The full system will involve the FNAL Enstore Mass Storage System. P L McBride Computing Strategy

  29. Summary and Conclusions • Computing is an integral part of the Fermilab scientific program and must adapt to new demands and an ever changing environment. • Fermilab is a leader in GRID computing, storage and networking for HEP applications and provides facilities for the Fermilab experiments, the astrophysics program and CMS. • R&D efforts in network, storage and other core systems have been key to smooth operations of the Fermilab computing facilities. • The Fermilab Tier-1 facility is a leader in CMS and computing and scientific staff associated with the center provide expertise and leadership to the collaboration. • We have a begun a program to address computing issues for the ILC. • Advanced Scientific Computing R&D is a vital part of the computing strategy at the lab: Accelerator simulations, Lattice QCD, Cosmological Computing. These applications need state of the art facilities to be competitive. P L McBride Computing Strategy

  30. Backup slides P L McBride Computing Strategy

  31. FNAL Computer Security The challenge in Computer Security is to maintain a balance of security and openness in support of open science. Risk-based program follows NIST standards An array of scanners and detectors with a central database (NIMI) Tracks every system connected to the FNAL network Identifies the sysadmin of every system Scans continuously & periodically for services and vulnerabilities Detects network anomalies Notifies and blocks non-compliant systems Central Laboratory-wide authentication system Kerberos- & Windows-based Kerberos-derived X.509 certificates The thing standing between us and millions of attacks a day is the computer security team… P L McBride Computing Strategy

  32. Mass Storage & Data Movement Enstore is used as the tape backend for storage of scientific data Presents a file-system view of tape storage Routinely move >30 TB per day in and out of Enstore dCache is used as a high-performance disk cache for transient data May be used w/ or w/o Enstore Provides Grid interfaces Supports many replicas for performance or reliability Build from commodity disk arrays (SATABeast) Both are joint projects involving High-Energy Physics and Grid collaborators P L McBride Computing Strategy

  33. LHC and OSG • OSG facility provides access to all US LHC resources (Tier-1s, 12 Tier-2s sites and >10 Tier-3s), contributing more than agreed upon data and job throughput for experiment data challenges and event simulations in 2006 and 2007. • OSG provides common shared grid services: operations, security, monitoring, accounting, testing etc. Partners with the European Grid infrastructure (EGEE), Nordu Data Grid Facility (NDGF) and other national infrastructures to make the World Wide LHC Computing Grid. • The OSG Virtual Data Toolkit (VDT) provides common grid middleware and support for both OSG and EGEE. Fermilab CD, US CMS and US ATLAS software and computing organizations develop common and reusable software through joint projects facilitated by OSG. from Ruth Pordes - OSG

  34. Scientific Linux at Fermilab Scientific Linux was born to enable HEP computer centers to continue to use opensource Linux distributions with support and security patches. Scientific Linux (SL) is a joint project between Fermilab, CERN and other contributors which provides an open-source distribution of Linux for the scientific (primarily High-Energy Physics) community Scientific Linux Fermi (SLF) provides Fermilab-specific customizations SL is installed at EGEE and OSG sites for LHC computing. SL and SLF are community-supported (primarily via mailing lists) SLF provides infrastructure for patching, inventory and configuration management Some applications (primarily Oracle) require commercially-supported Red Hat Linux See: https://www.scientificlinux.org/ P L McBride Computing Strategy

  35. Credits • Many thanks to the many people from CD who contributed to this presentation. • Apologies for all the work I could not show. P L McBride Computing Strategy

More Related