410 likes | 611 Views
Some Flavours of Computing at DESY. Rainer Mankel DESY Hamburg. DESY in General. National center of basic research in physics Member of HGF Sites: Hamburg + Zeuthen (near Berlin) About 1600 employees, including 400 scientists 1200 users in particle physics from 25 countries
E N D
Some Flavours of Computing at DESY Rainer Mankel DESY Hamburg
DESY in General • National center of basic research in physics • Member of HGF • Sites: Hamburg + Zeuthen (near Berlin) • About 1600 employees, including 400 scientists • 1200 users in particle physics from 25 countries • 2200 users in HASYLAB from 33 countries ... and almost everybody needs computing
DESY in a Nutshell • Four HERA experiments: H1 (ep), ZEUS (ep), HERMES (e N), HERA-B (pN): reconstruction, analysis, ... • Accelerators: machine controls • HASYLAB: synchrotron radiation • TTF
DESY: Future Projects • PETRA as a New High Brilliance Synchrotron Radiation Source: DESY plans to convert the PETRA storage ring into a new high brilliance third generation synchrotron radiation source. 1.4 MEUR from Federal Ministry of Education and Research for design phase • design report end 2003 • construction start in 2007? • TESLA: • e+e- Superconducting Linear Collider (0.5 ... 1 TeV) • integrated X-ray laser • this Monday 11:00: very positive recommendation from German Science Council (Wissenschaftsrat)
DESY Computing • Impossible to cover this large topic in a short talk • Restrict to some particular areas of interest
Mainframe, SMP Commodity hardware DM, Lit, Pta, ... Technologies: General Transitions IRIX Solaris
DESY Central Computing (IT Division) • O(70) people • Operating ~all imaginable services (mail, web, registry, databases, AFS, HSM, backup (Tivoli), Windows, networks, firewalls, dCache...) • Tape storage: 4 STK Powderhorn tape silos (interconnected) • media: 9840 cartridges (old, 20GB), 9940B (new, 200 GB)
Supported Operating Systems • IRIX 6.5, HP-UX on their way out • Alpha-OSF, AIX were never really supported but still used by some groups • Long-term support for • Linux • DESY Linux 3 (based on SuSE-Linux 6.3) until end of the year • DESY Linux 4 (based on SuSE-Linux 7.2) • Solaris • Linux installation/support service • YAST for initial installation • SALAD / BOOM (homemade tools) for dynamic software updates
HERA Experiments and Computing • HERA delivered record ep luminosity of 50 pb-1in 2000 • luminosity upgrade during 2001 • intended improvement factor of 5 • 1 fb-1 planned until 2006 • major detector upgrades in experiments • HERA experiments have their own expertise in computing • closer look at ZEUS computing
Computing of a HERA Experiment: ZEUS • General purpose ep collider experiment • About 450 physicists • Expect 20-40 TB/year of RAW data after luminosity upgrade • whole of DESY approaches PB regime during HERA-II lifetime • O(100) modern processors in farms for reconstruction & batch analysis • MC production distributed world-wide („funnel“), O(3-5 M events/week) routinely • funnel is an early computing grid New vertex detector within calorimeter (new ZEUS event display)
Tape storage incr. 20-40 TB/year MC production Data processing/ reprocessing Data mining Disk storage 3-5 TB/year ~450 Users General Challenge (ZEUS) 50 M 200 M Events/year Interactive Data Analysis
HERA-II Challenges (cont’d) • Data processing should be closely linked to data-taking • sufficient capacity for reprocessing • Analysis facility • increased standards of • reliability • transparency • turnaround • high level of approval from users essential • Interactive environment
Hardware • ZEUS phased out SGI Challenge SMPs in Feb 2002 • After first PC farm in 1997, computing has completely moved to PCs • Computing nodes Intel Pentium 350 MHz – 1.2 GHz, mostly dual processors • a new farm with dual Xeon 2.2 GHz just ordered • Fast Ethernet • central farm server with Gb Ethernet • Workgroup servers • SUN Sparcs phased out in April • new system: DELFI1 cluster (PC Intel/Linux) • File servers • SGI Origin with 8.5 TB of SCSI/FC (partially RAID5) • by now 7 commodity PC-based DELFI3 servers (12 TB)
HSM HSM HSM HSM HSM 1Gb/s SWITCH FILE SERVERS FARM SERVER 2 x 48 100Mb/s 100Mb/s 1Gb/s PC FARM Network Structure
Performance of Reconstruction Farm old farm new farm new farm + tuning 2 M Events/day
19” DELFI1* 2x 40 GB system (mirrored) 2x 80 GB workgroup space 3Ware 7850 controller 2x 40 GB system (mirrored) 6x 80 GB workgroup space stripe or RAID5 3Ware 7850 controller • for high-availability applications (workgroup servers) *DESY Linux File Server
Commodity File Servers DELFI3 • custom built (Invention: F. Collin / CERN) • 2x 40 GB system(EIDE) • 20x 120 GB data • 3 RAID controllers • Gb ethernet • 2.4 TB of storage for 13000 EUR
Commodity File Servers (cont’d) DELFI2 • 12 EIDE disks • 2 RAID controllers • 19” rack mount • more economic in terms of floor space • only few units yet
Batch System • ZEUS uses LSF 4.1 as underlying batch system, with a custom front end & user interface [H1: PBS] • originally introduced to integrate different batch systems (NQS, LSF) • Each job is executed in its own spool directory • no conflicts between several parallel jobs of the same user • User can specify resources required (e.g. “SuSE 7.2 operating system only”) • Our LSF 4.1 scheduler uses the fair share policy • ensures that also occasional users get their fair share of the system • no hard queue limits needed (as #jobs per user and queue) • “power users” can take ~unlimited resources when system has capacity to spare (Priority) History
ZEUS Monitoring • Efficient monitoring is a key for reliable operation of a complex system • Three independent monitoring systems introduced in ZEUS Computing during the shutdown: • LSF-embedded monitoring • statistics on time each jobs spends in queued/running/system-suspended/user-suspended state • quantitative information for queue optimization etc • SNMP • I/O traffic and CPU efficiency • web interface • history • NetSaint, now called Nagios • availability of various services on various hosts • notification • automated trouble-shooting
Example for SNMP-based Monitoring 90% CPU efficiency 1-3 MB/s input rate
NetSaint Monitoring system • Hosts, network devices, services (e.g. web server), disk space,… • thresholds configurable • Web interface • Notification (normally Email, if necessary SMS to cellular phone) • History
Reliability Issues • Tight monitoring of system is one key to reliability, but... • Typical analysis user needs to access huge amounts of data • In large systems, there will always be a certain fraction of • servers which are down or unreachable • disks which are broken • files which are corrupt • It is hopeless to operate a large system on the assumption that everything is always working • this is even more true for commodity hardware • Ideally, the user should not even notice that a certain disk has died, etc • jobs should continue
The ZEUS Event Store Gimme NC events, Q2>1000, at least one D* candidate with pT>5 Query Tag database (Objectivity/DB 7.0) Generic filename & event address MDST3.D000731.T011837.cz Filename de-referencing /acs/zeus/mini/00/D000731/MDST3.D000731.T011837.cz to I/O subsystem
Mass storage system Addressing Events on Datasets no Disk copy existing? Select disk cache server & stage yes no Disk copy still valid? Server up? yes establish RFIO connection to file analyze event
“The mass storage system is fundamental to the success of theexperiment” – Ian Bird at CHEP01 in Beijing
Cache within a cache within a cache Access time Classical Picture CPU 10-9 s Primary Cache 2nd Level Cache Memory Cache Disk Controller Cache 10-3 s Disk Tape Library 102 s
All-Cache Picture CPU • Disk files are only cached images of files in the tape library • Files are accessed via a unique path, regardless of server name etc • optimized I/O protocols Main board Cache 2nd Level Cache Memory Cache Disk Controller Cache Fabric Disk Cache Tape Library
dCache (cont’d) • Mass storage I/O subsystem should provide • transparent access to disk & tape data • smart caching of tape datasets • efficient I/O transfer protocol • Idea of dCache: distributed, centrally maintained system, joint DESY/FNAL development tpfs SSF dCache ZEUS (none) (none) dCache H1 (none) (none) dCache HERA-B FNAL [GRID] dCache 1997 1998 1999 2000 2001 2002
dCache Features • Optimised usage of tape robot by coordinated read and writerequests (read ahead, deferred writes) • Better usage of network bandwidth by exploring thebest location for data • Ensure efficient usage of available resources • Robot, drives, tapes, server resources, cpu time • Minimize the service downtime due tohardware failure • Monitored by DESY-IT operator • No NFS access to disk pools required - access proceeds via the dcap API (dc_open, dc_read, dc_write, …) • Particularly intriguing features: • retry-feature during read access – job does not crash even if file or server become unavailable (as already in ZEUS-SSF) • “Write pool” could be used by online chain (reduces #tape writes) • reconstruction could read RAW data directly from disk pool (no staging)
dCache Perspectives • dCache has been jointly developed by DESY & FNAL • DESY uses OSM as underlying HSM system, FNAL ENSTORE • Experiments using dCache: • ZEUS • H1 • HERA-B • CDF • MINOS • SDSS • CMS • GRID relevance • dCache is an integral part of a JAVA-based GridFTP server, completed & announced last week • successful inter-operation with globus-url-copy client http://www-dcache.desy.de
Future: Will We Continue To Use Tapes? • Tape • 100 $ per cartridge (200 GB), 5000 cartridges per silo • 100 k$ per silo • 30 k$ per drive (typical number: 10) • 0.9 $ / GB • Disk • 8 k$ per DELFI2 Server (1 TB) • 8 $ / GB (from V. Gülzow) Yes !
General Lab Culture • A high degree of consensus between providers & users is essential • At DESY, communication proceeds through • Computer Users Committee daily business • Computing Review Board long-range planning, projects • Computer Security Council security issues • Network Committee networking issues • Topical meetings • Linux Users Meeting • Windows Users Meeting • ... • direct communication between experiments‘ offline coordinators & IT • CUC & CRB are chaired by members of physics community
Summary • Only a glimpse of some facets of computing at DESY • Commodity equipment gives unprecedented power, but requires a dedicated fabric to work reliably