290 likes | 300 Views
This article discusses the computing challenges and architecture of the Circular Electron-Positron Collider (CEPC), including data volume, distributed computing, network resources, software distribution, workload management, data management, and monitoring.
E N D
Computing challenges and architecture Xiaomei Zhang IHEP April. 2019 CEPC workshop, Oxford, UK
Time Line Z factor in ~2037 Higgs factory in 2030 Higgs 4 year after HL_LHC Z ~11 year after HL_LHC
Data Volume • Read out in DAQ from CDR • Maximum Event rate: ~100KHz at Z peak • Data rate to trigger • ~2TB/s • Trigger rate not clear now • Event size from simulation • Size of signal event: ~500KB/event for Z, ~1MB/event for Higgs • Adding the background,likely increase to 5MB~10MB/event for Z and 10MB~20MB/event for Higgs • Estimated data rate output to disk: • Higgs/W factory (8 years) with 108events:1.5~3PB/year • Z factory (2 year) with 1011 ~ 1012 events: 0.5~5EB/year
CEPC computing challenges • Data: • Raw 2016: 50 PB 2027: 600 PB • Derived (1 copy): 2016: 80 PB 2027: 900 PB • Two stages: • Higgs/W factory: 1.5~3PB/year ( >10y) • Z factory : 0.5~5EB/year ( >17y) • Data volume in LHC and HL-LHC • LHC: 50PB/y in 2016 • HL-LHC: ~600PB/y in 2027 • No major computing problem expected in Higgs/W factory • Benefit from WLCG experience • Try to do it with less costs • The challenging part would be in Z factory • EB scale, same data volume level as HL-LHC • But 11 years after
Computing requirements • CEPC simulation for detector design needs ~2K CPU cores and ~2PB each year • Currently no enough funding to meet requirements • Distributed computing is the main way to collect and organize resources for R&D • Dedicated resources from funding • Contributions from collaborators • Share IHEP resources from other experiments through HTCondor and IHEPCloud • Commercial Cloud, SuperComputing center…
Distributed computing • CEPC distributed computing system has been built up based on DIRAC in 2015 • DIRAC provides a framework and solution for experiments to setup their own distributed computing system • Originally from LHCb, now widely used by other communities, such as BELLEII,ILC,CTA, EGI, etc • Good cooperation with DIRAC community • Join DIRAC consortium in 2016 • Join the efforts on common need • The system has considered current CEPC computing requirements, resource situation and manpower • Use existing grid solutions as much as possible from WLCG • Keep system as simple as possible for users and sites
Computing model • IHEP as central site • Event Generation(EG), analysis • Hold central storage for all data • Hold central database for detector geometry • Remote sites • MC production including Mokka simu + Marlin recon • No requirements of storage • Data flow • IHEP -> Sites, stdhep files from EG distributed to Sites • Sites -> IHEP, output MC data directly transfer back to IHEP from jobs • Simple but extensible • No problems to add more sites and resources • Storage can be extended to multi-tier infrastructure with more SEs distributed……
Network • IHEP international Network provides a good basis for distributed computing • 20Gbps outbound,10Gbps to Europe/USA • In 2018 IHEP joined LHCONE, with CSTNet and CERNET • LHCONE is a virtual link dedicated to LHC • CEPC international cooperation also can benefit from LHCONE
Resources • 6 Active Sites • England,Taiwan,China Universities(4) • QMUL from England and IPAS from Taiwan plays a great role • Resource: ~2500 CPU cores, shared resources with other experiments • Resource types include Cluster, Grid ,Cloud • This year ~500 CPU cores dedicated resources will be added from IHEP QMUL: Queen Mary University of London IPAS: Institute of Physics, Academia Sinica
Software distribution with CVMFS • CVMFS is a global, HTTP-based file system to distribute software in a fast and reliable way • CEPC uses CVMFS to distribute software • IHEP CVMFS service eventually joined federation • In 2014, IHEP CVMFS Stratum0(S0) created • In 2017, IHEP CVMFS Stratum1(S1) created, both replicate IHEP S0 and CERN S0 • In 2018, RAL S1 replicate IHEP S0 to speed up access of CEPC software among European collaborators
Workload management • DIRAC WMS • Middle layer between jobs and various resources • JSUB • Self-developed general-purpose massive job submission and management tool • Production system • Highly automate MC sim submission for next plan
Data management • Central Storage Element based on StoRM to share experiment data among sites • Lustre /cefs as its backend • Frontend provides SRM, HTTP and xrootd access • File Catalogue, metadata Catalogue and dataset Catalogue will provide a global view of CEPC dataset • DFC ( Dirac File Catalogue) could be one of the solutions • Data movement system among sites • DIRAC solution: Transformation + Request Management System + FTS • The prototype is ready
Monitoring • To ensure stability of sites, monitoring is developed and in production • A global site status view • Send Regular SAM tests to sites • Collect site status info • Take actions when sites failed • DIRAC system and job monitoring are using ElasticSearch (ES) and Kibana • More monitoring can be considered with ES and Grafana infrastructure
Central services in IHEP • Basic services are working • DIRAC (WMS,DMS) • Central storage server (1.1 PB) • CVMFS and Squid servers • Grid CA and VOMS services for user authentication in grid • Monitoring Running stable
Current production status • The distributed computing is taking full tasks of CEPC massive simulation for last four years • About 3 million jobs, data exchange about 2PB 2015~2018 jobs ~ 3 million 2015~2018data exchange ~ 2PB
On-going research NEAR FUTURE
Multi-core supports • Parallelism is considered in future CEPC software • Explore multicore CPU architectures and improve performance • Decrease memory usage per core • Study of Multi-core scheduling in distributed computing system started in 2017 • Prototype is successful • Multi-core scheduling with different pilot modes are developed • Scheduling efficiency is studied and improved • Real CEPC use cases can be tuned when ready
Light-weight virtualization - Singularity • Singularity based on container tech provides • Good OS portability and isolation in a very light way • Give sites enough flexibility to choose its OS • Singularity has been embedded in current system transparently • Singularity can be started by pilots to provide OS wanted • Allow sites to add another container layer outside • Tests have been done successfully in IHEP local Condor site • Ready for more usage to come Worker node Container Pilot job Singularity Singularity Payload Payload
Commercial Cloud integration • Commercial cloud would be a good potential resource for urgent and CPU-intensive tasks with its unlimited resource for CEPC • Cloud can be well integrated in current distributed computing system and used in an elastic way • Cloud resource can be occupied and released in real time according to real CEPC job requirements • With the support of Amazon AWS China region, trials have been done successfully with CEPC simulation jobs • Well connected, good running, return back results to IHEP • Cost pattern need further study depending on future usage
HPC federation • HPC becomes more and more important for data processing • HTC (High Throughout Computing) is main resource for HEP • Many HPC computing centers are being built up, eg.IHEP, JINR… • HPC federation is in plan to build a “grid” of HPC • Integrate HTC and HPC resources as a whole • Preliminary study has been done with GPU • With “tag”in DIRAC, GPU and CPU jobs can easily find their proper resources
Scaling up • With the scale of CERN today, to meet needs of HL-LHC • Resources still need x50~100 increase with current hardware • With past experiences, scaling up is not just a issue of increasing the capacity, but more about • Manpower and costs • Performance and efficiency of data access and processing • Complexity of resource provisioning and management • Will future evolutions help? • Technology • Infrastructure
Technology evolution • Huge data processing is also a challenge faced by industry, especially internet • Google searches 98PB, Google internet archives 18EB • Convergence of Big data and Deep learning with HPC are the technology trends in industry • Creating a promising future in exabyte era • HEP is trying to catch up • Both in software and computing • See such trends in CHEP,WLCG, ACAT,HSF workshops
Infrastructure evolution • Distributed computing can be the main way to organize resources • HL_LHC, CEPC • It will not a simple grid system anymore, but • Mixture of any possible resources, highly heterogeneous • Keep changing with technology evolutions in network, storage, software, analysis techniques…… • Can Supercomputing Centers be a dominated resource in future? • CMS started looking into it • “A single Exascale system could process the whole HL-LHC with no R&D or model changes”
Unexpected evolutions • Quantum computing is coming closer to us • Universal hardware in products expected to come in ~10 years (Intel) • CERN Openlab held its first quantum computing for High Energy Physics workshop on November 5-6 in 2018 • “A breakthrough in the number of qubits could emerge any time, Although there are challenges on different levels”
International Cooperation with HSF • To face the challenges, international cooperation is more than needed than ever • HSF (HEP Software Foundation) provides a platform for expertise in HEP communities to work together for future software and computing • IHEP cooperates with international communities and benefits from common efforts through HSF • IHEP was invited to give two plenary talks in HSF workshops
Summary • CEPC distributed computing system is working well for current CEPC R&D tasks • New technology studies have been carried out to meet requirements of CEPC in the near future • Look into the future, closely cooperate with HEP community and follow technology trends