220 likes | 340 Views
Fabric Management for CERN Experiments Past, Present, and Future. Tim Smith CERN/IT. Contents. The Fabric of CERN today The new challenges of LHC computing What has this got to do with the GRID Fabric Management solutions of tomorrow? The DataGRID Project. Functionalities
E N D
Fabric Managementfor CERN ExperimentsPast, Present, and Future Tim Smith CERN/IT
Contents • The Fabric of CERN today • The new challenges of LHC computing • What has this got to do with the GRID • Fabric Management solutions of tomorrow? • The DataGRID Project Tim Smith: HEPiX @ JLab
Functionalities Batch and Interactive Disk servers Tape Servers + devices Stage servers Home directory servers Application servers Backup service Infrastructure Job Scheduler Authentication Authorisation Monitoring Alarms Console managers Networks Fabric Elements Tim Smith: HEPiX @ JLab
Fabric Technology at CERN PC Farms 10000 Multiplicity Scale 1000 PC Farms RISC Workstations 100 Scalable Systems SP2 CS2 SMPs SGI,DEC,HP,SUN 10 RISC Workstations Mainframes IBM Cray 1 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 Year Tim Smith: HEPiX @ JLab
Architecture Considerations • Physics applications have ideal data parallelism • mass ofindependent problems • No message passing • throughput rather than performance • resilience rather than ultimatereliability • Can build hierarchies of mass market components • High Throughput Computing Tim Smith: HEPiX @ JLab
Component Architecture High capacitybackboneswitch Application Server 100/1000baseT switch CPU CPU CPU CPU CPU Disk Server 1000baseT switch Tape Server Tape Server Tape Server Tape Server Tim Smith: HEPiX @ JLab
Analysis Chain: Farms event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reconstruction analysis objects (extracted by physics topic) event simulation interactive physics analysis Tim Smith: HEPiX @ JLab
Multiplication ! tomog 1200 tapes pcsf 1000 nomad na49 800 na48 na45 #CPUs mta 600 lxbatch lxplus 400 lhcb l3c 200 ion eff cms 0 ccf Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00 atlas alice Tim Smith: HEPiX @ JLab
PC Farms Tim Smith: HEPiX @ JLab
Shared Facilities Tim Smith: HEPiX @ JLab
LHC Computing Challenge • The scale will be different • CPU 10k SI95 1M SI95 • Disk 30TB 3PB • Tape 600TB 9PB • The model will be different • There are compelling reasons why some of the farms and some of the capacity will not be located at CERN Tim Smith: HEPiX @ JLab
Estimated disk storage capacity at CERN Bad News: Tapes < factor 2 reduction in 8 years Significant fraction of cost Non-LHC LHC Moore’s Law Estimated CPU capacity at CERN Bad News: IO 1996: 4G @10MB/s 1TB – 2500MB/s 2000: 50G @ 20 MB/s 1TB – 400 MB/s Non-LHC ~10K SI951200 processors LHC Tim Smith: HEPiX @ JLab
Regional Centres:a Multi-Tier Model CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps Uni n 622 Mbps Lab a Tier2 Uni b Lab c Department Desktop MONARC http://cern.ch/MONARC Tim Smith: HEPiX @ JLab
CERN – Tier 0 IN2P3 2.5 Gbps 622 Mbps RAL FNAL Tier 1 155 mbps Uni n 155 mbps Lab a Tier2 622 Mbps Uni b Lab c Department Desktop More realistically:a Grid Topology DataGRID http://cern.ch/grid Tim Smith: HEPiX @ JLab
Can we build LHC farms? • Positive predictions • CPU and disk price/performance trends suggest that the raw processing and disk storage capacities will be affordable, and • raw data rates and volumes look manageable • perhaps not today for ALICE • Space, power and cooling issues? • So probably yes… but can we manage them? • Understand costs - 1 PC is cheap, but managing 10000 is not! • Building and managing coherent systems from such large numbers of boxes will be a challenge. 1999: CDR @ 45MB/s for NA48! 2000: CDR @ 90MB/s for Alice! Tim Smith: HEPiX @ JLab
Management Tasks I • Supporting adaptability • Configuration Management • Machine / Service hierarchy • Automated registration / insertion / removal • Dynamic reassignment • Automatic Software Installation and Management (OS and applications) • Version management • Application dependencies • Controlled (re)deployment Tim Smith: HEPiX @ JLab
Management Tasks II • Controlling Quality of Service • System Monitoring • Orientation to the service NOT the machine • Uniform access to diverse fabric elements • Integrated with configuration (change) management • Problem Management • Identification of root causes (faults + performance) • Correlate network / system / application data • Highly automated • Adaptive - Integrated with configuration management Tim Smith: HEPiX @ JLab
Relevance to the GRID ? • Scalable solutions needed in absence of GRID ! • For the GRID to work it must be presented withinformationandopportunities • Coordinated and efficiently run centres • Presentable as a guaranteed quality resource • ‘GRID’ification : the interfaces Tim Smith: HEPiX @ JLab
Mgmt Tasks: A GRID centre • GRID enable • Support external requests: services • Publication • Coordinated + ‘map’able • Security: Authentication / Authorisation • Policies: Allocation / Priorities / Estimation / Cost • Scheduling • Reservation • Change Management • Guarantees • Resource availability / QoS Tim Smith: HEPiX @ JLab
Existing Solutions ? • The world outside is moving fast !! • Dissimilar problems • Virtual super computers (~200 nodes) • MPI, latency, interconnect topology and bandwith • Roadrunner, LosLobos, Cplant, Beowulf • Similar problems • ISPs / ASPs (~200 nodes) • Clustering: high availability / mission critical • The DataGRID : Fabric Management WP4 Tim Smith: HEPiX @ JLab
WP4 Partners • CERN (CH) Tim Smith • ZIB (D) Alexander Reinefeld • KIP (D) Volker Lindenstruth • NIKHEF (NL) Kors Bos • INFN (I) Michele Michelotto • RAL (UK) Andrew Sansum • IN2P3 (Fr) Denis Linglin Tim Smith: HEPiX @ JLab
Concluding Remarks • Years of experience in exploitinginexpensive mass market components • But we need to marry these withinexpensive highly scalablemanagement tools • Build components back together as a resource for the GRID Tim Smith: HEPiX @ JLab