Computing challenges and architecture

Computing challenges and architecture Xiaomei Zhang IHEP April. 2019 CEPC workshop, Oxford, UK

Time Line Z factor in ~2037 Higgs factory in 2030 Higgs 4 year after HL_LHC Z ~11 year after HL_LHC

Data Volume • Read out in DAQ from CDR • Maximum Event rate: ~100KHz at Z peak • Data rate to trigger • ~2TB/s • Trigger rate not clear now • Event size from simulation • Size of signal event: ~500KB/event for Z, ~1MB/event for Higgs • Adding the background,likely increase to 5MB~10MB/event for Z and 10MB~20MB/event for Higgs • Estimated data rate output to disk: • Higgs/W factory (8 years) with 108events:1.5~3PB/year • Z factory (2 year) with 1011 ~ 1012 events: 0.5~5EB/year

CEPC computing challenges • Data: • Raw 2016: 50 PB  2027: 600 PB • Derived (1 copy): 2016: 80 PB  2027: 900 PB • Two stages: • Higgs/W factory: 1.5~3PB/year ( >10y) • Z factory : 0.5~5EB/year ( >17y) • Data volume in LHC and HL-LHC • LHC: 50PB/y in 2016 • HL-LHC: ~600PB/y in 2027 • No major computing problem expected in Higgs/W factory • Benefit from WLCG experience • Try to do it with less costs • The challenging part would be in Z factory • EB scale, same data volume level as HL-LHC • But 11 years after

Computing status in R&D NOW

Computing requirements • CEPC simulation for detector design needs ~2K CPU cores and ~2PB each year • Currently no enough funding to meet requirements • Distributed computing is the main way to collect and organize resources for R&D • Dedicated resources from funding • Contributions from collaborators • Share IHEP resources from other experiments through HTCondor and IHEPCloud • Commercial Cloud, SuperComputing center…

Distributed computing • CEPC distributed computing system has been built up based on DIRAC in 2015 • DIRAC provides a framework and solution for experiments to setup their own distributed computing system • Originally from LHCb, now widely used by other communities, such as BELLEII，ILC，CTA, EGI, etc • Good cooperation with DIRAC community • Join DIRAC consortium in 2016 • Join the efforts on common need • The system has considered current CEPC computing requirements, resource situation and manpower • Use existing grid solutions as much as possible from WLCG • Keep system as simple as possible for users and sites

Computing model • IHEP as central site • Event Generation(EG), analysis • Hold central storage for all data • Hold central database for detector geometry • Remote sites • MC production including Mokka simu + Marlin recon • No requirements of storage • Data flow • IHEP -> Sites, stdhep files from EG distributed to Sites • Sites -> IHEP, output MC data directly transfer back to IHEP from jobs • Simple but extensible • No problems to add more sites and resources • Storage can be extended to multi-tier infrastructure with more SEs distributed……

Network • IHEP international Network provides a good basis for distributed computing • 20Gbps outbound，10Gbps to Europe/USA • In 2018 IHEP joined LHCONE, with CSTNet and CERNET • LHCONE is a virtual link dedicated to LHC • CEPC international cooperation also can benefit from LHCONE

Resources • 6 Active Sites • England,Taiwan,China Universities(4) • QMUL from England and IPAS from Taiwan plays a great role • Resource: ~2500 CPU cores, shared resources with other experiments • Resource types include Cluster, Grid ,Cloud • This year ~500 CPU cores dedicated resources will be added from IHEP QMUL: Queen Mary University of London IPAS: Institute of Physics, Academia Sinica

Software distribution with CVMFS • CVMFS is a global, HTTP-based file system to distribute software in a fast and reliable way • CEPC uses CVMFS to distribute software • IHEP CVMFS service eventually joined federation • In 2014, IHEP CVMFS Stratum0(S0) created • In 2017, IHEP CVMFS Stratum1(S1) created, both replicate IHEP S0 and CERN S0 • In 2018, RAL S1 replicate IHEP S0 to speed up access of CEPC software among European collaborators

Workload management • DIRAC WMS • Middle layer between jobs and various resources • JSUB • Self-developed general-purpose massive job submission and management tool • Production system • Highly automate MC sim submission for next plan

Data management • Central Storage Element based on StoRM to share experiment data among sites • Lustre /cefs as its backend • Frontend provides SRM, HTTP and xrootd access • File Catalogue, metadata Catalogue and dataset Catalogue will provide a global view of CEPC dataset • DFC ( Dirac File Catalogue) could be one of the solutions • Data movement system among sites • DIRAC solution: Transformation + Request Management System + FTS • The prototype is ready

Monitoring • To ensure stability of sites, monitoring is developed and in production • A global site status view • Send Regular SAM tests to sites • Collect site status info • Take actions when sites failed • DIRAC system and job monitoring are using ElasticSearch (ES) and Kibana • More monitoring can be considered with ES and Grafana infrastructure

Central services in IHEP • Basic services are working • DIRAC （WMS,DMS) • Central storage server (1.1 PB) • CVMFS and Squid servers • Grid CA and VOMS services for user authentication in grid • Monitoring Running stable

Current production status • The distributed computing is taking full tasks of CEPC massive simulation for last four years • About 3 million jobs, data exchange about 2PB 2015~2018 jobs ~ 3 million 2015~2018data exchange ~ 2PB

On-going research NEAR FUTURE

Multi-core supports • Parallelism is considered in future CEPC software • Explore multicore CPU architectures and improve performance • Decrease memory usage per core • Study of Multi-core scheduling in distributed computing system started in 2017 • Prototype is successful • Multi-core scheduling with different pilot modes are developed • Scheduling efficiency is studied and improved • Real CEPC use cases can be tuned when ready

Light-weight virtualization - Singularity • Singularity based on container tech provides • Good OS portability and isolation in a very light way • Give sites enough flexibility to choose its OS • Singularity has been embedded in current system transparently • Singularity can be started by pilots to provide OS wanted • Allow sites to add another container layer outside • Tests have been done successfully in IHEP local Condor site • Ready for more usage to come Worker node Container Pilot job Singularity Singularity Payload Payload

Commercial Cloud integration • Commercial cloud would be a good potential resource for urgent and CPU-intensive tasks with its unlimited resource for CEPC • Cloud can be well integrated in current distributed computing system and used in an elastic way • Cloud resource can be occupied and released in real time according to real CEPC job requirements • With the support of Amazon AWS China region, trials have been done successfully with CEPC simulation jobs • Well connected, good running, return back results to IHEP • Cost pattern need further study depending on future usage

HPC federation • HPC becomes more and more important for data processing • HTC (High Throughout Computing) is main resource for HEP • Many HPC computing centers are being built up, eg.IHEP, JINR… • HPC federation is in plan to build a “grid” of HPC • Integrate HTC and HPC resources as a whole • Preliminary study has been done with GPU • With “tag”in DIRAC, GPU and CPU jobs can easily find their proper resources

Technology evolution and cooperation FUTURE

Scaling up • With the scale of CERN today, to meet needs of HL-LHC • Resources still need x50~100 increase with current hardware • With past experiences, scaling up is not just a issue of increasing the capacity, but more about • Manpower and costs • Performance and efficiency of data access and processing • Complexity of resource provisioning and management • Will future evolutions help? • Technology • Infrastructure

Technology evolution • Huge data processing is also a challenge faced by industry, especially internet • Google searches 98PB, Google internet archives 18EB • Convergence of Big data and Deep learning with HPC are the technology trends in industry • Creating a promising future in exabyte era • HEP is trying to catch up • Both in software and computing • See such trends in CHEP,WLCG, ACAT,HSF workshops

Infrastructure evolution • Distributed computing can be the main way to organize resources • HL_LHC, CEPC • It will not a simple grid system anymore, but • Mixture of any possible resources, highly heterogeneous • Keep changing with technology evolutions in network, storage, software, analysis techniques…… • Can Supercomputing Centers be a dominated resource in future? • CMS started looking into it • “A single Exascale system could process the whole HL-LHC with no R&D or model changes”

Unexpected evolutions • Quantum computing is coming closer to us • Universal hardware in products expected to come in ~10 years (Intel) • CERN Openlab held its first quantum computing for High Energy Physics workshop on November 5-6 in 2018 • “A breakthrough in the number of qubits could emerge any time, Although there are challenges on different levels”

International Cooperation with HSF • To face the challenges, international cooperation is more than needed than ever • HSF (HEP Software Foundation) provides a platform for expertise in HEP communities to work together for future software and computing • IHEP cooperates with international communities and benefits from common efforts through HSF • IHEP was invited to give two plenary talks in HSF workshops

Summary • CEPC distributed computing system is working well for current CEPC R&D tasks • New technology studies have been carried out to meet requirements of CEPC in the near future • Look into the future, closely cooperate with HEP community and follow technology trends

Thank you!

Computing challenges and architecture

Computing challenges and architecture

Presentation Transcript

Capability Computing Challenges and Payoffs

Data Challenges and Fabric Architecture

Cloud Computing Challenges and Potential

Exascale Computing: Challenges and Opportunities

Data architecture challenges

Pervasive Computing: Vision and Challenges

Cloud Computing and Challenges

Cloud Computing Architecture

Capability Computing Challenges and Payoffs

Trusted Computing: Opportunities and Challenges.

Public Computing - Challenges and Solutions

Quantum Computing: Opportunities and Challenges

Cloud Computing Benefits and Challenges

CLOUD COMPUTING: PROSPECTS AND CHALLENGES

CPM architecture and challenges

Ubiquitous Computing: Issues and Challenges

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing

CLOUD COMPUTING: PROSPECTS AND CHALLENGES

Cloud Computing Architecture, Services, Deployment Models, Storage, Benefits and Challenges