1 / 22

Fabric Management for CERN Experiments Past, Present, and Future

Fabric Management for CERN Experiments Past, Present, and Future. Tim Smith CERN/IT. Contents. The Fabric of CERN today The new challenges of LHC computing What has this got to do with the GRID Fabric Management solutions of tomorrow? The DataGRID Project. Functionalities

lilian
Download Presentation

Fabric Management for CERN Experiments Past, Present, and Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fabric Managementfor CERN ExperimentsPast, Present, and Future Tim Smith CERN/IT

  2. Contents • The Fabric of CERN today • The new challenges of LHC computing • What has this got to do with the GRID • Fabric Management solutions of tomorrow? • The DataGRID Project Tim Smith: HEPiX @ JLab

  3. Functionalities Batch and Interactive Disk servers Tape Servers + devices Stage servers Home directory servers Application servers Backup service Infrastructure Job Scheduler Authentication Authorisation Monitoring Alarms Console managers Networks Fabric Elements Tim Smith: HEPiX @ JLab

  4. Fabric Technology at CERN PC Farms 10000 Multiplicity Scale 1000 PC Farms RISC Workstations 100 Scalable Systems SP2 CS2 SMPs SGI,DEC,HP,SUN 10 RISC Workstations Mainframes IBM Cray 1 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 Year Tim Smith: HEPiX @ JLab

  5. Architecture Considerations • Physics applications have ideal data parallelism • mass ofindependent problems • No message passing • throughput rather than performance • resilience rather than ultimatereliability • Can build hierarchies of mass market components • High Throughput Computing Tim Smith: HEPiX @ JLab

  6. Component Architecture High capacitybackboneswitch Application Server 100/1000baseT switch CPU CPU CPU CPU CPU Disk Server 1000baseT switch Tape Server Tape Server Tape Server Tape Server Tim Smith: HEPiX @ JLab

  7. Analysis Chain: Farms event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reconstruction analysis objects (extracted by physics topic) event simulation interactive physics analysis Tim Smith: HEPiX @ JLab

  8. Multiplication ! tomog 1200 tapes pcsf 1000 nomad na49 800 na48 na45 #CPUs mta 600 lxbatch lxplus 400 lhcb l3c 200 ion eff cms 0 ccf Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00 atlas alice Tim Smith: HEPiX @ JLab

  9. PC Farms Tim Smith: HEPiX @ JLab

  10. Shared Facilities Tim Smith: HEPiX @ JLab

  11. LHC Computing Challenge • The scale will be different • CPU 10k SI95 1M SI95 • Disk 30TB 3PB • Tape 600TB 9PB • The model will be different • There are compelling reasons why some of the farms and some of the capacity will not be located at CERN Tim Smith: HEPiX @ JLab

  12. Estimated disk storage capacity at CERN Bad News: Tapes < factor 2 reduction in 8 years Significant fraction of cost Non-LHC LHC Moore’s Law Estimated CPU capacity at CERN Bad News: IO 1996: 4G @10MB/s 1TB – 2500MB/s 2000: 50G @ 20 MB/s 1TB – 400 MB/s Non-LHC ~10K SI951200 processors LHC Tim Smith: HEPiX @ JLab

  13. Regional Centres:a Multi-Tier Model CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps Uni n 622 Mbps Lab a Tier2 Uni b Lab c   Department  Desktop MONARC http://cern.ch/MONARC Tim Smith: HEPiX @ JLab

  14. CERN – Tier 0 IN2P3 2.5 Gbps 622 Mbps RAL FNAL Tier 1 155 mbps Uni n 155 mbps Lab a Tier2 622 Mbps Uni b Lab c   Department  Desktop More realistically:a Grid Topology DataGRID http://cern.ch/grid Tim Smith: HEPiX @ JLab

  15. Can we build LHC farms? • Positive predictions • CPU and disk price/performance trends suggest that the raw processing and disk storage capacities will be affordable, and • raw data rates and volumes look manageable • perhaps not today for ALICE • Space, power and cooling issues? • So probably yes… but can we manage them? • Understand costs - 1 PC is cheap, but managing 10000 is not! • Building and managing coherent systems from such large numbers of boxes will be a challenge. 1999: CDR @ 45MB/s for NA48! 2000: CDR @ 90MB/s for Alice! Tim Smith: HEPiX @ JLab

  16. Management Tasks I • Supporting adaptability • Configuration Management • Machine / Service hierarchy • Automated registration / insertion / removal • Dynamic reassignment • Automatic Software Installation and Management (OS and applications) • Version management • Application dependencies • Controlled (re)deployment Tim Smith: HEPiX @ JLab

  17. Management Tasks II • Controlling Quality of Service • System Monitoring • Orientation to the service NOT the machine • Uniform access to diverse fabric elements • Integrated with configuration (change) management • Problem Management • Identification of root causes (faults + performance) • Correlate network / system / application data • Highly automated • Adaptive - Integrated with configuration management Tim Smith: HEPiX @ JLab

  18. Relevance to the GRID ? • Scalable solutions needed in absence of GRID ! • For the GRID to work it must be presented withinformationandopportunities • Coordinated and efficiently run centres • Presentable as a guaranteed quality resource • ‘GRID’ification : the interfaces Tim Smith: HEPiX @ JLab

  19. Mgmt Tasks: A GRID centre • GRID enable • Support external requests: services • Publication • Coordinated + ‘map’able • Security: Authentication / Authorisation • Policies: Allocation / Priorities / Estimation / Cost • Scheduling • Reservation • Change Management • Guarantees • Resource availability / QoS Tim Smith: HEPiX @ JLab

  20. Existing Solutions ? • The world outside is moving fast !! • Dissimilar problems • Virtual super computers (~200 nodes) • MPI, latency, interconnect topology and bandwith • Roadrunner, LosLobos, Cplant, Beowulf • Similar problems • ISPs / ASPs (~200 nodes) • Clustering: high availability / mission critical • The DataGRID : Fabric Management WP4 Tim Smith: HEPiX @ JLab

  21. WP4 Partners • CERN (CH) Tim Smith • ZIB (D) Alexander Reinefeld • KIP (D) Volker Lindenstruth • NIKHEF (NL) Kors Bos • INFN (I) Michele Michelotto • RAL (UK) Andrew Sansum • IN2P3 (Fr) Denis Linglin Tim Smith: HEPiX @ JLab

  22. Concluding Remarks • Years of experience in exploitinginexpensive mass market components • But we need to marry these withinexpensive highly scalablemanagement tools • Build components back together as a resource for the GRID Tim Smith: HEPiX @ JLab

More Related