320 likes | 438 Views
Analyse de Physique sur machines RISC : expériences au CERN SACLAY 20 JUIN 1994. Frédéric Hemmer Computing & Networks Division CERN, Geneva, switzerland. CERN - The European Laboratory for Particle Physics. Fundamental research in particle physics Designs, builds & operates large accelerators
E N D
Analyse de Physique sur machines RISC : expériences au CERNSACLAY20 JUIN 1994 Frédéric HemmerComputing & Networks DivisionCERN, Geneva, switzerland
CERN - The European Laboratory for Particle Physics • Fundamental research in particle physics • Designs, builds & operates large accelerators • Financed by 19 European countries • SFR 950M budget - operation + new accelerators • 3,000 staff • Experiments conducted by a small number of large collaborations: 400 physicists, 50 institutes, 18 countriesusing experimental apparatus costing 100s of MSFR
Computing at CERN • computers are everywhere • embedded microprocessors • 2,000 personal computers • 1,400 scientific workstations • RISC clusters, even mainframes • estimate 40 MSFR per year (+ staff)
Central Computing Services • 6,000 users • Physics data processing traditionally: mainframes + batchemphasis on:reliability, utilisation level • Tapes: 300,000 active volumes 22,000 tape mounts per week
Application Characteristics • inherent coarse grain parallelism (at event or job level) • Fortran • modest floating point content • high data volumes • disks • tapes, tape robots • moderate, but respectable, data rates - a few MB/sec per fast RISC cpu Obvious candidate for RISC clusters A major challenge
CORE - Centrally Operated Risc Environment • Single management domain • Services configured for specific applications, groupsbut common system management • Focus on data - external access to tape and disk services from CERN network, or even outside CERN
equipment installed or on order Jamuary 1994 CSF Simulation Facility CERN CORE Physics Services SHIFTData intensive services Central Data Services Shared Tape Servers 25 H-P 9000-735 H-P 9000-750 3 tape robots 21 tape drives 6 EXABYTEs 7 IBM, SUN servers Processors: 24 SGI; 11 DEC Alpha; 9 H-P; 2 SUN; 1 IBM Embedded disk: 1.1 TeraBytes Shared Disk Servers PIAF - Interactive Analysis Facility 260 GBytes 6 SGI, DEC, IBM servers 5 H-P 9000-755 100 GB RAID disk Home directories& registry Scalable Parallel Processors • 8 node SPARCcenter • 32 node Meiko CS-2 • (Early 1994) SPARCservers Baydel RAID disks tape juke box CERN Network consoles &monitors SPARCstations les robertson /cn
equipment installed or on order Jamuary 1994 CSF Simulation Facility CERN CORE Physics Services SHIFTData intensive services Central Data Services Shared Tape Servers 25 H-P 9000-735 H-P 9000-750 3 tape robots 21 tape drives 6 EXABYTEs 7 IBM, SUN servers Processors: 24 SGI; 11 DEC Alpha; 9 H-P; 2 SUN; 1 IBM Embedded disk: 1.1 TeraBytes Shared Disk Servers PIAF - Interactive Analysis Facility 260 GBytes 6 SGI, DEC, IBM servers 5 H-P 9000-755 100 GB RAID disk Home directories& registry Scalable Parallel Processors • 8 node SPARCcenter • 32 node Meiko CS-2 • (Early 1994) SPARCservers Baydel RAID disks tape juke box CERN Network consoles &monitors SPARCstations les robertson /cn
equipment installed or on order Jamuary 1994 CSF Simulation Facility CERN CORE Physics Services SHIFTData intensive services Central Data Services Shared Tape Servers 25 H-P 9000-735 H-P 9000-750 3 tape robots 21 tape drives 6 EXABYTEs 7 IBM, SUN servers Processors: 24 SGI; 11 DEC Alpha; 9 H-P; 2 SUN; 1 IBM Embedded disk: 1.1 TeraBytes Shared Disk Servers PIAF - Interactive Analysis Facility 260 GBytes 6 SGI, DEC, IBM servers 5 H-P 9000-755 100 GB RAID disk Home directories& registry Scalable Parallel Processors • 8 node SPARCcenter • 32 node Meiko CS-2 • (Early 1994) SPARCservers Baydel RAID disks tape juke box CERN Network consoles &monitors SPARCstations les robertson /cn
equipment installed or on order Jamuary 1994 CSF Simulation Facility CERN CORE Physics Services SHIFTData intensive services Central Data Services Shared Tape Servers 25 H-P 9000-735 H-P 9000-750 3 tape robots 21 tape drives 6 EXABYTEs 7 IBM, SUN servers Processors: 24 SGI; 11 DEC Alpha; 9 H-P; 2 SUN; 1 IBM Embedded disk: 1.1 TeraBytes Shared Disk Servers PIAF - Interactive Analysis Facility 260 GBytes 6 SGI, DEC, IBM servers 5 H-P 9000-755 100 GB RAID disk Home directories& registry Scalable Parallel Processors • 8 node SPARCcenter • 32 node Meiko CS-2 • (Early 1994) SPARCservers Baydel RAID disks tape juke box CERN Network consoles &monitors SPARCstations les robertson /cn
equipment installed or on order Jamuary 1994 CSF Simulation Facility CERN CORE Physics Services SHIFTData intensive services Central Data Services Shared Tape Servers 25 H-P 9000-735 H-P 9000-750 3 tape robots 21 tape drives 6 EXABYTEs 7 IBM, SUN servers Processors: 24 SGI; 11 DEC Alpha; 9 H-P; 2 SUN; 1 IBM Embedded disk: 1.1 TeraBytes Shared Disk Servers PIAF - Interactive Analysis Facility 260 GBytes 6 SGI, DEC, IBM servers 5 H-P 9000-755 100 GB RAID disk Home directories& registry Scalable Parallel Processors • 8 node SPARCcenter • 32 node Meiko CS-2 • (Early 1994) SPARCservers Baydel RAID disks tape juke box CERN Network consoles &monitors SPARCstations les robertson /cn
CSF - Central Simulation Facility • second generation, joint project with H-P ethernet • interactive host • job queues • shared, • load balanced • H-P 750 tape servers FDDI • 25 H-P 735s - 48 MB memory, 400MB disk • one job per processor • generates data on local disk • staged out to tape at end of job • long jobs (4 to 48 hours) • very high cpu utilisation : >97% • very reliable : > 1 month MTBI
SHIFTScalable, Heterogeneous, Integrated, Facility • Designed in 1990 • fast access to large amounts of disk data • good tape support • cheap & easy to expand • vendor independent • mainframe quality • First implementation in production within 6 months
Design choices • Unix + TCP/IP • system-wide batch job queues “single system image” target Cray style & service quality • pseudo distributed file system assumes no read/write file sharing • distributed tape staging model (disk cache of tape files) • the tape access primitives are copy disk file to tape copy tape file to disk
disk servers cpu servers stage servers queue servers tape servers The Software Model IP network • Define functional interfaces ---- scalable • heterogeneous • distributed
Basic Software } • Unix Tape Subsystem • (multi-user, labels, multi-file, operation) • Fast Remote File Access System • Remote Tape Copy System • Disk Pool Manager • Tape Stager • Clustered NQS batch system • Integration with standard I/O packages • FATMEN, RZ, FZ, EPIO, .. • Network Operation • Monitoring
Unix Tape Control • tape daemon • operator interface / robot interface • tape unit allocation / deallocation • label checking, writing
Remote Tape Copy System • selects a suitable tape server • initiates the tape-disk copy tpread -v CUT322 -g SMCF -q 4,6 pathname tpwrite -v IX2857 -q 3-5 file 3 file4 file5 tpread -v UX3465 `sfget -p opaldst file34`
Remote File Access System - RFIO high performance, reliability (improve on NFS) • C I/O compatibility library Fortran subroutine interface • rfio daemon started by open on remote machine • optimised for specific networks • asynchronous operation (read ahead) • optional vector pre-seek • ordered list of the records which will probably be read next
a disk pool is a collection of Unix file systems, possibly on several nodes, viewed as a single chunk of allocatable space sgi1 dec24 disk pool sun5
Disk Pool Management • allocation of files to pools • pools can be public or private • and filesystems • capacity management • name server • garbage collection • pools can be temporary or permanent • example: • sfget -p opaldst file26 • may create file like: • /shift/shd01/data6/ws/panzer/file26
Tape Stager • implements a disk cache of magnetic tape files • integrates: Remote Tape Copy System & Disk Pool Management • queues concurrent requests for same tape file • provides full error recovery - restage &/or operator control on hardware/system error initiate garbage collection if disk full • supports disk pools & single (private) file systems • available from any workstation
Tape Stager independent stage control for each disk pool stage control sfget file tpread tape, file tape server rtcopy tape, file RFIO disk server cpu server (user job) stagein tape, file
SHIFT Statusequipment installed or on order January 1994 CERN group configuration -- capacity -- cpu(CU*) disk(GB) OPAL SGI Challenge 4-cpu + 8-cpu (R4400 - 150 MHz) 290 590 Two SGI 340S 4-cpu (R3000 - 33MHz) ALEPHSGI Challenge 4-cpu (R4400 - 150MHz) 216 200 Eight DEC 9000-400 DELPHI Two H-P 9000/735 52 200 L3 SGI Challenge 4-cpu (R4400 - 150MHz) 80 300 ATLAS H-P 9000/755 26 23 CMS H-P 9000/735 26 23 SMC SUN SPARCserver10, 4/630 22 4 CPLEAR DEC 3000-300AXP, 500AXP 29 10 CHORUS IBM RS/6000-370 15 15 NOMAD DEC 3000-500 AXP 19 15 Totals 775 1380 * CERN-Units:one CU equals approx. 4 SPECints (CERN IBM mainframe 120 600)
Current SHIFT Usage • 60% cpu utilisation • 9,000 tape mounts per week, 15% write still some way from holding the active data on disk • MTBI - cpu and disk servers 400 hours for an individual server • MTBF for disks: 160K hours maturing service, but does not yet surpass the quality of the mainframe
CORE Networking Ethernet+ Fibronics hubs - aggregate 2 MBytes/sec sustained FDDI + GigaSwitch - 2-3 MBytes sustained Simulation service UltraNet 1 Gbps backbone 6 MBytes/sec sustained IBM mainframe SHIFT cpu servers Home directories SHIFT disk servers SHIFT tape servers connection to CERN & external networks
FDDI Performance(September 1993) 100 MByte disk file read/written sequentially using 32KB records client: H-P 735 server: SGI Crimson, SEAGATE Wren 9 disk system read write NFS 1.6 MB/sec 300 KB/sec RFIO 2.7 MB/sec 1.7 MB/sec
PIAF - Parallel Interactive Data Analysis Facility(R.Brun, A.Nathaniel, F.Rademakers CERN) • the data is “spread” across the interactive server cluster • the user formulates a transaction on his personal workstation • the transaction is executed simultaneously on all servers • the partial results are combined and returned to the user’s workstation
PIAF Architecture user personal workstation display manager PIAF client PIAF Service PIAF server PIAF worker PIAF worker PIAF worker PIAF worker PIAF worker
Scalable Parallel Processors • embarrassingly parallel application -therefore in competition with workstation clusters • SMPs and SPPs should do a better job for SHIFT than loosely coupled clusters • computing requirements will increase by three orders of magnitude over next ten years • R&D project started, funded by ESPRIT - GPMIMD2 32 processor Meiko CS-2 25 man-years development
Conclusion • Workstation clusters have replaced mainframes at CERN for physics data processing • For the first time, we see computing budgetscome within reach of the requirements • Very large, distributed & scalable disk and tape configurations can be supported • Mixed manufacturer environments work, and allow smooth expansion of the configuration • Network performance is the biggest weakness in scalability • Requires a different operational style & organisation from mainframe services
Operating RISC machines • SMP’s easier to manage • SMP’s requires less manpower • Distributed management not yet robust • Network is THE problem • Much easier than mainframes, and • ... cost effective