270 likes | 352 Views
Take on messages from Lecture 1. LHC Computing has been well sized to handle the production and analysis needs of LHC ( very high data rates and throughputs) Based on the hierarchical Monarc model It has been very successful WLCG operates smoothly and reliably
E N D
Take on messages from Lecture 1 • LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs) • Based on the hierarchical Monarc model • It has been very successful • WLCG operates smoothly and reliably • Data is well transferred and made available in a very short time to everybody • Higgs boson discovery was announced within a week from latest data update! • Network has worked well and allows now for computing model changes
Grid computing enables the rapid delivery of physics results Ian.Bird@cern.ch / August 2012
Computing Model Evolution Evolution of computing models Hierarchy Mesh 4
Evolution ALICE Remote Access • Duringthe development the evolution of the WLCG Production grid has oscillated between structure and flexibility • Driven by capabilities of the infrastructure and the needs of the experiments PD2P/ Popularity CMS Full Mesh 5
Structur Data Management Evolution • Data management in the WLCG has been moving to a less deterministic system as the software improved • Started with deterministic pre-placement of data on disk storage for all samples (ATLAS) • Then subscriptions driven by physics groups (CMS) • Then dynamic placement of data based on access to only replicate samples that were going to be looked at (ATLAS) • Once IO is optimized and network links improve we can send data over the wide area so jobs can run anywhere and access the data (ALICE, ATLAS, CMS) • Good for opportunistic resources, balancing, clouds, or any other time when the sample will be accessed only once Less Deterministic 6
Structur Scheduling Evolution • Scheduling evolution has similar drivers • We started with a very deterministic system where jobs were sent directly to a specific site • This leads to early binding of jobs to resources requests idle in long queues, no ability to reschedule • All 4 experiments evolved to use a set of pilots to make better scheduling decisions based on current information • The pilot system now evolves further to allow submission to additional resources like clouds • What began as a deterministic system has evolved to flexibility in scheduling and resources Less Deterministic 7
Data Access Frequency • More dynamic data placement is needed • less restrictions in where the data comes from • but data is still pushed to sites Tier-1 Tier-1 Tier-2 Tier-2 Tier-2 ATLAS Ian Fisk FNAL/CD
Popularity • Services like the Data Popularity Service track all the file accesses and can show what data is accessed and for how long • Over a year, popular data stays that way for reasonable long periods of time CMS Data Popularity Service Ian Fisk FNAL/CD
Dynamic Data Placement • ATLAS uses the central queue and popularity to understand how heavily used a dataset is • Additional copies of the data made • Jobs re-brokered to use them • Unused copies are cleaned Requests PANDA Tier-1 Tier-2 Ian Fisk FNAL/CD
Wide Area Access • With optimized IO other methods of managing the data and the storage are available • Sending data directly to applications over the WAN • Allows users to open any file regardless of their locations or the file’s source • Sites deploy at least one xrootd server that acts as a proxy/door Ian Fisk FNAL/CD
Transparent Access to Data • Once we have a combination of dynamic placement, wide area access to data, and reasonable networking then facilities we can be treated as part of a coherent system • Also opens doors to use new kinds of resources (opportunistic resorces, commercial clouds, data centers..)
Example: Expanding the CERN Tier0 • CERN is deploying a remote computing facility in Budapest • 200Gb/s of networking between the centers at 35ms ping time • As experiments we cannot really tell the difference where resources are installed CERN Budapest 100Gb/s 100Gb/s
Tier 0: Wigner Data Centre, Budapest • New facility due to be ready at the end of 2012 • 1100m² (725m²) in an existing building but new infrastructure • 2 independent HV lines • Full UPS and diesel coverage for all IT load (and cooling) • Maximum 2.7MW
Networks • These 100Gb/s links are the first in production for WLCG • Other sites will soon follow • We have reduced the differences in site functionality • Then reduced the difference in even the perception that two sites are separate • We can begin to think of the facility as a big center and not a cluster of center • This concept can be expanded to many facilities
Changing the Services • The WLCG service architecture has been reasonably stable for over a decade • This is beginning to change with new Middleware for resource provisioning • A variety of places are opening their resources to “Cloud” type of provisioning • From a site perspective this is often chosen for cluster management and flexibility reasons • Everything is virtualized and services are put on top
Clouds vs Grids • Grids offer primarily standard services with agreed protocols • Designed to be generic, but execute a particular task • Clouds offer the ability to build custom services and functions • More flexible, but also more work for users
Trying this out • CMS and ATLAS are trying to provision resources like this with the High Level Trigger farms • Open Stack interfaced to the Pilot systems • In CMS we got to 6000 running cores and the facility looks like another destination, though no grid CE exists • It will be used for large scale production running in a few weeks • Already several sites have requested similar connections to local resources
WLCG will remain a Grid • We have a grid because: • We need to collaborate and share resources • Thus we will always have a “grid” • Our network of trust is of enormous value for us and for (e-)science in general • We also need distributed data management • That supports very high data rates and throughputs • We will continually work on these tools • We are now working on how to integrate Cloud Infrastructures in WLCG
Need for Common Solutions • Computing infrastructure is a needed piece to the ultimate core mission of HEP experiments • development effort is steadily decreasing • Common solutions try to take advantage of the similarities in the experiment activities • optimize development effort and offer lower long-term maintenance and support costs • Together with the willingness of the experiments to work together • Successful examples in Distributed Data Management, Data Analysis, Monitoring( HammerCloud, Dashboards, Data Popularity, the Common Analysis Framework , …) • Taking advantage of the Long Shut-down 1
Evolution of Capacity: CERN & WLCG Modest growth until 2014 Anticipate x2 in 2015 Anticipate x5 after 2018 What we thought was needed at LHC start What we actually used at LHC start!
CMS Resource Utilization • Resource Utilization was highest in 2012 for both Tier-1 and Tier-2 sites
CMS Resource Utilization • Growth curves for resources
C Conclusions Conclusions • First years of LHC data – WLCG has helped deliver physics rapidly • Data available everywhere within 48h • Just the start of decades of exploration of new physics • Sustainable solutions! • Entering a phase of consolidation and at the same time evolution • LS1: opportunity for disruptive changes and scale testing of new technologies • Wide area access, dynamic data placement, new analysis tools, clouds • Challenges for computing – scale & complexity – will continue to increase 28
Evolving the Infrastructure VM with Pilots VM with Pilots VM with Pilots VM with Pilots • In the new resource provisioning model the pilot infrastructure communicates with the resource provisioning tools directly • Requesting groups of machines for periods of time VM with Pilots VM with Pilots VM with Pilots Resource Provisioning Cloud Interface Resource Requests WN with Pilots WN with Pilots WN with Pilots Resource Provisioning CE WN with Pilots WN with Pilots WN with Pilots WN with Pilots Batch Queue Pilots 29