280 likes | 399 Views
o penlab BoS meeting CERN 4 th May 2011. WLCG after 1 year with data: Prospects for the future. Ian Bird; WLCG Project Leader. Overview. Quick review of WLCG Summary of 1 st year with data Achievements, successes, lessons Outlook for the next 3 years What are our challenges?.
E N D
openlabBoS meeting CERN 4th May 2011 WLCG after 1 year with data:Prospects for the future Ian Bird; WLCG Project Leader
Overview • Quick review of WLCG • Summary of 1st year with data • Achievements, successes, lessons • Outlook for the next 3 years • What are our challenges? Ian.Bird@cern.ch
The LHC Computing Challenge • Signal/Noise: 10-13 (10-9 offline) • Data volume • High rate * large number of channels * 4 experiments • 15 PetaBytes of new data each year • Compute power • Event complexity * Nb. events * thousands users • 200 k of (today's) fastest CPUs • 45 PB of disk storage • Worldwide analysis & funding • Computing funding locally in major regions & countries • Efficient analysis everywhere • GRID technology >250 k cores today 100 PB disk today!!! Ian Bird, CERN
WLCG – what and why? • A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments • Managed and operated by a worldwide collaboration between the experiments and the participating computer centres • The resources are distributed – for funding and sociological reasons • Our task was to make use of the resources available to us – no matter where they are located • Tier-0 (CERN): • Data recording • Initial data reconstruction • Data distribution • Tier-1 (11 centres): • Permanent storage • Re-processing • Analysis • Tier-2 (~130 centres): • Simulation • End-user analysis Ian Bird, CERN
Worldwide resources • Today >140 sites • >250k CPU cores • >100 PB disk WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations • Today we have 49 MoU signatories, representing 34 countries: • Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. Ian.Bird@cern.ch
1st year of LHC data Data written to tape (GB/day) Stored ~ 15 PB in 2010 Writing up to 70 TB / day to tape (~ 70 tapes per day) >5GB/s to tape during HI ~ 2 PB/month to tape pp ~ 4 PB to tape in HI Disk Servers (GB/s) • Tier 0 storage: • Accepts data at average of 2.6 GB/s; peaks > 11 GB/s • Serves data at average of 7 GB/s; peaks > 25 GB/s • CERN Tier 0 moves > 1 PB data per day
Grid Usage 1 M jobs/day Use remains consistently high: • >1 M jobs/day; • ~150k CPU CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 12 months 100k CPU-days/day At the end of 2010 we saw all Tier 1 and Tier 2 job slots being filled CPU usage now >> double that of mid-2010 (inset shows build up over previous years) As well as LHC data, large simulation productions always ongoing • Large numbers of analysis users: • ATLAS, CMS ~800 • LHCb,ALICE ~250 In 2010 WLCG delivered ~ 80-100 CPU-millennia!
CPU – around the Tiers • The grid really works • All sites, large and small can contribute • And their contributions are needed! • Significant use of Tier 2s for analysis • Tier 0 usage peaks when LHC running – average is much less Jan 2011 was highest use month ever … so far Ian.Bird@cern.ch
LHC running: April – Sept 2010 Data transfers Re-processing 2010 data CMS HI data zero suppression & FNAL 2011 data Tier 1s & the academic/research networks for Tier1/2! ALICE HI data Tier 1s Ian Bird, CERN
Successes: • We have a working grid infrastructure • Experiments have truly distributed models • Has enabled physics output in a very short time • Network traffic close to that planned – • and the network is extremely reliable • Significant numbers of people doing analysis (at Tier 2s) • Today resources are plentiful, and no contention seen ... yet • Support levels manageable ... just Ian.Bird@cern.ch
2011+2012 running • LHC schedule now has continuous running 2011 + 2012 – expected high integrated luminosity (== lots of interesting data) • Impacts: • Resources – funding agencies asked to fund more resources in 2012 (had previously expected an “off” year) • Push back upgrades or upgrade during running • Oracle 11g, network switches, online clusters, OS versions, etc. • Mostly an issue for accelerator or experiment control-related; for WLCG there is NO downtime, ever. • … and • The no. events /collision much higher than anticipated for now • larger event sizes (hence more data volume), more processing time Ian.Bird@cern.ch
Evolution of requirements Ian.Bird@cern.ch
Some areas where openlab partners have contributed to this success …(in no particular order ) Ian Bird, CERN
Databases Databases everywhere (LHC, experiments, offline, remote) – large scale deployment and distributed databases: e.g. Streams for data replication Ian.Bird@cern.ch
CPU & performance CPU/machines: evaluation of new generations Performance optimisation – how to use many-core machines Ian.Bird@cern.ch
Monitoring New ways to view monitoring data Gridmaps now appear everywhere This was a good example of tapping into expertise and experience within the company Ian.Bird@cern.ch
Networking Technology evaluations (e.g. 10 Gb) Campus networking and security – essential for physics analysis at CERN Ian.Bird@cern.ch
and some challenges for the future … Ian Bird, CERN
Challenges: • Resource efficiency • Behaviour with resource contention • Efficient use – experiments struggle to live within resource expectations, physics is potentially limited by resources now! • Changing models – to more effectively use what we have • Evolving data management • Evolving network model • Integrating other federated identity management schemes • Sustainability • Grid middleware – has it a future? • Sustainability of operations • Is (commodity) hardware reliable enough? • Changing technology • Using “clouds” • Other things - NoSQL, etc. • Move away from “special” solutions Ian Bird, CERN
Grids clouds?? • We have a grid because: • We need to collaborate and share resources • Thus we will always have a “grid” • Our network of trust is of enormous value for us and for (e-)science in general • We also need distributed data management • That supports very high data rates and throughputs • We will continually work on these tools • But, the rest can be more mainstream (open source, commercial, … ) • We use message brokers more and more as inter-process communication • Virtualisation of our grid sites is happening • many drivers: power, dependencies, provisioning, … • Remote job submission … could be cloud-like • Interest in making use of commercial cloud resources, especially for peak demand • We should invest effort only where we need to Ian.Bird@cern.ch
Virtualisation and clouds • Is clearly of great interest • CERN has several threads: • Service consolidation of “VO managed services” • “kiosk” – request a VM via a web interface • Batch service: • Tested Platform ISF and OpenNebula • Did very large scaling tests • Very interested in Openstack • Both for cluster management and storage system • Potentially a large community behind • Could be leading towards (de-facto) standards for clouds • Questions: • is S3 a possible alternative as a storage interface? • Can we virtualise (most of) our computing infrastructure? • Have much less types of hardware purchase? • Remove distinction between CPU and Disk servers? • Do we still need a traditional batch scheduler? • How easy to burst out to commercial clouds? • How feasible to use cloud interfaces for distributed job management between grid(cloud) sites? • How much grid middleware can we obsolete? Ian.Bird@cern.ch
Resource efficiency • Resource contention (see also sustainable ops) • Need better “monitoring”; we have lots of information, but: • Really need the ability to mine and analyse monitoring data: within and across services: trends, correlations • Need warnings of problems before they happen • Can this lead to automated actions/reactions/recovery? • Efficiency of use • Many-core CPU & other architectures • CPU efficiency – jobs wait for data? How important is it? (CPU is cheap…) • Does a virtualised infrastructure help? Ian.Bird@cern.ch
Computing model evolution • Recognise network as a resource • Data on-demand will augment data pre-placement • Storage systems will become more dynamic caches • Allow remote data access • fetch files when needed • I/O over WAN • Network usage will (eventually) increase & be more dynamic (less predictable) Evolution of computing models Ian.Bird@cern.ch
Evolution of data management • A consequence of the computing model evolution • Data caching rather than organised data placement • Distinguish between data archives and data caches • Only allow organised access to archives • Simplifies interfaces – no need for full SRM • Potential to replace archives with commercial back-up solutions (that scale sufficiently!) • Tools to support: • Remote data access (all aspects) • Reliable transfer (we have this, but clearly needs reworking) • Cache management • Low latency, high throughput file access (for reading) Ian.Bird@cern.ch
Network evolution Evolution of computing models also require evolution of network infrastructure • Open exchange pointsbuilt in carrier-neutral facilities: any connector can connect with their own fiber or using circuits provided by any telecom provider • enables T2s and T3s to obtain their data from any T1 or T2 • Use of LHCONE will alleviate the general R&E IP infrastructure • LHCONE provides connectivity directly to T1s, T2s, and T3s, and to various aggregation networks, such as the European NRENs, GÉANT, etc. Ian.Bird@cern.ch
Sustainability:Service incidents (outage/degradation) • Service incidents – last 2 quarters – any service degradation generates a Service Incident Report (SIR == post-mortem) • This illustrates quite strongly that the majority (>~75%) of the problems experienced are not related to the distributed nature of the WLCG at all (or grid middleware) • How can we make the effect of outages less intrusive? • Can we automate recovery (or management)? • Do user community have reasonable expectations? (no…) • Not unique to WLCG !!! Ian.Bird@cern.ch
Everyone has service failures… Forgot the part about keeping their customers informed … Where are the SIRs??? Some inform their customers .. … and some don’t! Failures can and do happen … but these incidents raise many questions for cloud services: How safe is my data? … where is it? Privacy? Who checks? Dependencies? Ian.Bird@cern.ch
Summary Conclusions Ian.Bird@cern.ch WLCG has been a great success and been key in the rapid delivery of physics from LHC Challenge now is to be more effective and efficient – computing should limit physics as little as possible