180 likes | 277 Views
Grid Computing in the. Experiment at LHC. José M. Hernández CIEMAT. Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid. The CMS Experiment at the LHC. The Large Hadron Collider p-p collisions, 7 TeV, 40 MHz. The Compact Muon Solenoid Precision measurements
E N D
Grid Computing in the Experiment at LHC José M. Hernández CIEMAT Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid
The CMS Experiment at the LHC The Large Hadron Collider p-p collisions, 7 TeV, 40 MHz The Compact Muon Solenoid Precision measurements Search for new phenomena Grid Computing in CMS
LHC: a challenge for computing • The Large Hadron Collider at CERN is the largest scientific instrument on the planet • Unprecedented data handling scale • 40 MHz event rate (~1 GHz collision rate) → 100TB/s → online filtering to ~300 Hz (~300 MB/s) → ~3 PB/year (107 secs data taking/year) • Need large computing power to process data • Complex events • Many interesting signals << Hz • Thousands of scientists around the world access and analyze the data • Need computing infrastructure able to store, move around the globe, process, simulate and analyze data at the Petabyte scale [O(10) PB/year]
The LHC Computing Grid • The LHC Computing Grid provides the distributed computing infrastructure • Computing resources (CPU, storage, networking) • Computing services (data and job management, monitoring, etc) • Integrated to provide a single LHC computing service • Using Grid technologies • Transparent and reliable access to heterogeneous computing resources geographically distributed via internet • High capacity wide area networking LCG: 300+ centers, 50+ countries, ~100k CPUs, ~ 100PB disk/tape, 10k users
The CMS Computing Model • Distributed computing model for data storage, processing and analysis • Grid technologies (Worldwide LHC Computing Grid, WLCG) • Tiered architecture of computing resources • ~20 Petabytes of data (real and simulated) every year • About 200k jobs (data processing, simulation production and analysis) per day
WLCG network infrastructure • T0-T1 and T1-T1 interconnected via LHCOPN (10 Gpbs links) • T1-T2 and T2-T2 using general research networks • Dedicated network infrastructure (LHCONE) being deployed Grid Computing in CMS
Grid services in WLCG • Middleware providers: gLite/EMI, OSG, ARC • Global services: data transfers and job management, authentication / authorization, information system • Compute (gateway, local batch system, WNs) andstorage (gridftp servers, disk servers, mass storage system) elements at the sites • Experiment specific services Grid Computing in CMS
CMS Data and Workload Management Grid Services CMS Services Sites • Experiment-specific DMWM services on top of basic Grid services • Pilot-based WMS • Data bookkeeping, location and transfer systems • Data pre-located • Jobs go to data • Experiment software pre-installed at sites SE CE Pilot-based WMS Production System (WMAgent) Operators CE SE gLite WMS Analysis System (CRAB) Local batch system Mass storage system Users File Transfer System Data Bookkeeping & location system (DBS) Data Transfer System (PhEDEx) Grid Computing in CMS
CMS Grid Operations - Jobs • Large scale data processing & analysis • ~50k used slots, 300k jobs/day • Plots correspond Aug 2011 – Jan 2012 Grid Computing in CMS
Spanish contribution to CMS Computing Resources • Spain contributes with ~ 5% of the CMS computing resources • PIC Tier-1 • ~1/2 average Tier-1 • 3000 cores, 4 PB disk, 6 PB tape • IFCA Tier-2 • ~ 2/3 average Tier-2 (~3% T2 resources) • 1000 CPUs, 600 TB disk • CIEMAT Tier-2 • ~ 2/3 average Tier-2 (~3% T2 resources) • 1000 cores, 600 TB disk
Contribution from Spanish sites CPU delivered Feb 2011 – Jan 2012 • ~5 % of total CPU delivered for CMS Grid Computing in CMS
CMS Grid Operations - Data • Large scale data replication • 1-2 GB/s throughput CMS-wide • ~1 PB/week data transfers • Full mesh 50+ sites T0 T1 T1 T2 T2 Production transfers 1 GB/s debug transfers 1 GB/s Grid Computing in CMS
Site monitoring/readiness Grid Computing in CMS
Lessons learnt • Porting the production and analysis applications to the Grid was easy • Package job wrapper and user libraries into input sandbox • Experiment software pre-installed at the sites • Job wrapper sets up environment, runs the job, stages out output • When running at large scale in WLCG, additional services are needed • Job and data management services on top of Grid services • Data bookkeeping and location • Monitoring Grid Computing in CMS
Lessons learnt • Monitoring is essential • Multi-layer complex system (experiment, Grid, site layers) • Monitor workflows, services, sites • Experiment services should be robust • Deal with (inherent) Grid unreliability • Be prepared for retries, cool-off • Pilot-based WMS • gLite BDII and WMS not reliable enough • Smaller overhead, verify node environment, global priorities, etc • Isolating users from the Grid; Grid operations team • Lots of manpower needed to operate the system • Central operations team (~20 FTE) • Contacts at sites (50+) Grid Computing in CMS
Future developments • Dynamic data placement/deletions • Most of the pre-located data not really accessed much • Investigating automatic replication of hot data, deletion of cold data • Replicate data when accessed by jobs and cache locally • Remote data access • Jobs go to free slots and access data remotely • CMS has improved a lot read performance over WAN • At the moment only used as fail-over and overflow • Service to asynchronously copy user data • Remote stage out from WN is a bad idea • Multi-core processing • More efficient use of multi-core nodes, savings in RAM, many less jobs to handle Grid Computing in CMS
Future developments • Virtualization of WNs/Cloud computing • Decouple node OS and application environment using VMs or chroot • Allow use of opportunistic resources • CERN VMFS for experiment software Grid Computing in CMS
Summary • CMS has been very successful in using the LHC Computing Grid at large scale • Lot of work to make the system efficient, reliable and scalable • Some developments in the pipeline to make CMS distributed computing more dynamic and transparent Grid Computing in CMS