80 likes | 216 Views
The CERN Cloud Computing Project. William Lu, Ph.D. Platform Computing. CERN/Outside Resource Ratio ~1:2 Tier0/( Tier1)/( Tier2) ~1:1:1. ~ PByte /sec. ~100-1500 MBytes /sec. Online System. Experiment. CERN Center PBs of Disk; Tape Robot. Tier 0 +1. Tier 1. 10 Gbps.
E N D
The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing
Markus Schulz, CERN CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 ~PByte/sec ~100-1500 MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 10 Gbps FNAL Center IN2P3 Center INFN Center RAL Center 2.5-10Gbps Tier 2 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center ~2.5-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by 2010.An Exabyte ~5-7 Years later. Physics data cache 0.1 to 10 Gbps Tier 4 Workstations LHC Computing Hierarchy Emerging Vision: A Richly Structured, Global Dynamic System
Environment Computers: • 40,000 CPU cores used by multiple experiments Storage: • Disks + tapes • Storage management system (CASTOR) is tightly integrated with workload management (Platform LSF) Software: • Apps: Open source, home grown, • OS: Scientific Linux, other Linux • VMs: open source XEN, KVM
Challenges IT serves users manually • User requests of resource, OS, software stack etc. are handled manually, which is slow Users circumvent scheduling policies • Users are not satisfied with the centralized management scheduling policies due to their unique needs • They submit a pilot job to occupy resources then run scripts to prepare the application environment and schedule jobs within the resource block. This causes low resource utilization Legacy application issues • Legacy applications need legacy OS, which does not run on the latest hardware
Batch Virtualization Requirements • How to • Insolate application environment • Increased security • How to • Automate resource provisioning and management • Scalable management practice Virtualization
Solution Platform ISF + Platform ISF Adaptive Cluster • Integration with Platform LSF to provision VMs based on workload • Integration with provisioning system Quattor • Each experiment is able toschedule their own VMclusters with uniqueapplication environment • VM cluster capacity is elastic based on workload
How It Works? 4 Platform ISF AC interacts with Platform ISF to adjust the size of the resource pool 3 User submits a workload that cannot be met by his VM resource pool Platform LSF Platform LSF Platform LSF Platform ISF AC Platform ISF AC Platform ISF AC 1 HPC administrator sets up VM resource pools, one for each experiment 2 HPC administrator also sets up minimum and maximum number of VMs within each pool Platform ISF Shared pool of resources External Provider
Results Increase user service level • Each experiment can control their own application stack and resource allocation policies Redeploy servers quickly and efficiently • Reduce cost and save power • Shares batch compute servers with data management and database servers • Automated administration • Allow scalability • No hypervisor lock-in • Freedom of choosing multiple VM hypervisors