Perspectives on LHC Computing

Perspectives on LHC Computing José M. Hernández (CIEMAT, Madrid) On behalf of the Spanish LHC Computing community Jornadas CPAN 2013, Santiago de Compostela

The LHC Computing Challenge • The Large Hadron Collider (LHC) delivered in Run 1 (2010-2012) billions of recorded collisions to the experiments • ~ 100 PB of data stored at CERN on tape • The Worldwide LHC Computing Grid (WLCG) provides compute and storage resources for data processing, simulation and analysis • ~ 300k cores, ~200 PB disk, ~200 PB tape • The computing challenge resulted in a great success • Unprecedented data volume analyzed in record time delivering great scientific results (e.g. Higgs boson discovery) LHC Computing Perspectives

Global effort, global success LHC Computing Perspectives

Computing is part of the global effort Computing CMS Computing Upgrade and Evolution

WLCG (initial) computing model • Distributed computing resources managed using Grid technologies that needed to be developed • Centers interconnected via private and national high-capacity Ethernet networks • Centers provide mass storage (disk/tape servers) and CPU resources (x86 CPUs) • Hierarchical tiered structure • Detector data prompt reconstruction and calibration at the Tier-0 at CERN • Data intensive processing at Tier-1’s • User analysis and simulation production at Tier-2’s (LHCb only simulation) • Data tape archival at Tier-0 and Tier-1’s • Data caches at Tier-2s (except LHCb) All available WLCG resources have been intensively used during LHC Run 1 LHC Computing Perspectives

ATLAS Computing scale in LHC Run 1 • 150k slots continuously utilized • ~1.4M jobs/day completed • More than 5 GB/s transfer rate worldwide 10GB/s LHC Computing Perspectives

CMS Computing scale in LHC Run 1 • ~100 PB transferred between sites • ~2/3 for data analysis at T2s • Resource usage saturation. In 2012: • 70k slots continuously utilized • ~500k jobs/day completed LHC Computing Perspectives

Computing challenges for Run2 • Computing in LHC Run1 was very successful but Run 2 from 2015 poses new challenges • Increased energy and luminosity delivered by LHC in Run 2 • More complex events to process • Event reconstruction time (CMS ~2x) • Higher output rate to record • Maintain similar trigger thresholds and sensitivity to Higgs physics and to potential new physics • ATLAS, CMS event rate to storage 2.5x • Need a substantial increase of computing resources that we probably cannot afford LHC Computing Perspectives

Upgrading LHC Computing in LS1 • The shutdown period is a valuable opportunity to asses • Lessons and operational experiences of Run 1 • Computing demands of Run 2 • The technical and cost evolution of computing • Undertake intensive planning and development to prepare LHC Computing for 2015 and beyond • While sustaining steady state full scale operations • With an assumption of constrained funding • This has been happening internally to the experiments and collaboratively with CERN IT, WLCG, common software and computing projects • Upgrade in parallel to accelerator and detector upgrades to push the frontiers of HEP LHC Computing Perspectives

Computing strategy for Run2 • Increase resources in WLCG as much as possible • Try to conform to constrained budget situation • Make a more efficient and flexible use of the available resources • Reduce CPU and storage needs • Less reprocessing passes, less simulated events, more compact data format, reduce data replication factor • Intelligent dynamic data placement • Automatic replication of hot data and deletion of cold data • Break down the boundaries between the computing tiers • Run reconstruction, simulation and analysis at Tier-1/Tier-2 indistinctly • Tier-1s extension of the Tier-0 • Keep higher service level and custodial tape storage at Tier-1 • Centralized production of group analysis datasets • Shrink ‘chaotic analysis’ to only what really is user specific • Remove redundancies in processing and storage, reducing operational workloads while improving turnaround for users LHC Computing Perspectives

Access to new resources for Run 2 • Access to opportunistic resources • HPC clusters, academic or commercial clouds, volunteer computing • Significant increase in capacity with low cost (satisfy capacity peaks) • Use HLT farm for offline data processing • A significant resource (>10k slots) • During extended periods with no data taking and even inter-fill periods • Adopt advanced architectures • Processing in Run1 done under Enterprise Linux on x86 CPUs • Many-core processors, low-power CPUs, GPU environments • Challenging heterogeneous environment • Parallelization of processing application will be key LHC Computing Perspectives

Computing resources increase HS06 • ~25% yearly growth preliminary requests for Run 2 • Benefit from technology evolution to buy more capacity with same money PB LHC Computing Perspectives

Processing evolution …but clock speed growth suffered a heat death… • Sustaining throughput growth by replacing ever faster processors with a higher number of cores, co-processors, concurrency features • New environment: high concurrency, modest memory/core, GPUs • Multi-core now  many-core soon  finer grained parallelism needed • Many or most of our codes require extensive overhauls • Being adapted: geant4, root, reconstruction code, exp. frameworks Transistor count growth is holding up… LHC Computing Perspectives

Data Management • Where is LHC in Big Data Terms? Reputed capacity of NSA’s new Utah data center: 5000 PB (50-100 MW, $2 billion) Big Data in 2012 Lib of Congress Business emails sent 3000PB/year (Doesn’t count; not managed as a coherent data set) We are big… Climate DB Facebook uploads 180PB/year Current LHC data set, all data products: ~250 PB LHC data 15PB/yr Google search 100PB Nasdaq US Census YouTube 15PB/yr Digital health 30PB Wired Magazine 4/2013 LHC Computing Upgrade and Evolution

Data Management evolution • Data access model during LHC Run1 • Pre-locate and replicate data at sites, send jobs to the data • We need more efficient distributed data handling, lower disk storage demands and better use of available CPU resources • The network has been very reliable and has experimented a large increase in bandwidth • (Aspire to) send only the data you need, only where you need it (and cache it when it arrives) • Towards transparent distributed data access enabled by the network • Industry has been at this approach for years, in content delivery networks • Already successful approaches during Run 1… LHC Computing Perspectives

Data Management evolution in Run 1 • Scalable access to conditions data • Frontier for Scalable Distributed DB Access • Caching web proxies provide hierarchical, highly scalable cache based data access • Experiment software provisioning to the worker nodes • CERNVM File System (CVMFS) • Evolve towards a distributed data federation… LHC Computing Perspectives

Data Management evolution • Distributed data federation • A collection of disparate storage resources transparently accessible across a wide area via a common namespace (CMS AAA, ATLAS FAX) • Needs efficient remote I/O • CMS has invested heavily in I/O optimizations within the application to allow efficient reading of the data over the (long latency) network using the xrootd technology while maintaining a high CPU efficiency • Extending initial use cases: fallback on local access failure, overflow busy sites, allow interactive access to data, use diskless sites • Interesting approach: ATLAS event service • Ask for exactly what you need, have it delivered by a service that knows how to get it to you efficiently • Return the outputs in a ~steady stream, such that a WN can be lost with little lost processing • Well suited to transient opportunistic resources, volunteer computing where preemption cannot be avoided • Well suited for high-CPU low I/O workflows LHC Computing Perspectives

From Grid to Clouds • Turning computing into a utility providing infrastructure as a service • Clouds evolve, complement and extend the Grid • Decrease heterogeneity seen by the user (hardware virtualization) • VMs provide a uniform user interface to resources • Integrate diverse resources manageably • Isolate software from physical hardware • Dynamic provision of resources • New resources (commercial, research clouds) • Huge community behind Cloud software • Grid of clouds already used by LHC exps • Several sites provide Cloud interface • ATLAS ~450k production jobs from Google over a few weeks • Tests on amazon EC spot pricing ~economically viable LHC Computing Perspectives

Conclusions • LHC computing performed extremely well at all levels in Run 1 • We know how to deliver, adapting where necessary • Excellent networks, flexible and adaptable computing models and software systems paid off in exploiting resources • LHC computing needs to face new challenges for LHC Run 2 • Large increase of computing resources required from 2015 • Live within constrained budgets • Use resources we own as fully and efficiently as possible • Support major development program required • Access to opportunistic and cloud resources, explore new computer and processing architectures • Evolve towards dynamic data access & distributed parallel computing • Explosive growth in data and (highly granular) processors in the wider world gives us a powerful ground for success in our evolution path • Evolve towards a more dynamic, efficient and flexible system LHC Computing Perspectives

Perspectives on LHC Computing

Perspectives on LHC Computing

Presentation Transcript

Grids and LHC Computing

Perspectives – surface computing

LHC computing

The LHC Computing Challenge

LHC status and perspectives

LHC Computing Grid

LHC Computing Grid Deployment

LHC Computing Grid

Computing at LHC

LHC Computing Review Recommendations lhc-computing-review-public.web.cern.ch

LHC Computing Grid Project

CMS LHC-Computing

LHC Computing Plans

LHC Computing Grid Project

LHC Computing Grid Project

Summary of the LHC Computing Review lhc-computing-review-public.web.cern.ch

LHC status and perspectives

The LHC Computing Grid