590 likes | 684 Views
From Bytes to Physics: Computing in the HERA-II Era. Rainer Mankel DESY Hamburg. Challenges in HERA-II Computing. Scaling Compute Nodes & File Servers. Mass Storage. Application Software: a Modern Event Display. Collaborative Tools: the ZEMS System. Outline. e 27 GeV. p 920 GeV.
E N D
From Bytes to Physics: Computing in the HERA-II Era Rainer Mankel DESY Hamburg
Challenges in HERA-II Computing Scaling Compute Nodes & File Servers Mass Storage Application Software: a Modern Event Display Collaborative Tools: the ZEMS System Outline R. Mankel, Computing in the HERA-II Era
e 27 GeV p 920 GeV Neutral current event from 2002 run Computing of a HERA Experiment: ZEUS • General purposeexperiment at theep collider HERA • About 450 physicists • HERA-II run targets 4-5 fold increase of luminosity compared to HERA-I R. Mankel, Computing in the HERA-II Era
Physics Drives Computing • For a total HERA-II integrated luminosity of 1 fb-1 per experiment, the total data size for all experiments will approach the petabyte scale R. Mankel, Computing in the HERA-II Era
Tape storage incr. 20-40 TB/year MC production Data processing/ reprocessing Data mining Interactive Data Analysis Disk storage 3-5 TB/year ~450 Users Computing Challenge of a HERA Experiment (ZEUS) O(1 M) detector channels 50 M 200 M Events/year R. Mankel, Computing in the HERA-II Era
Computing Challenges (Cont’d) Efficient analysis with access to hundreds of TB Reconstruction right after data-taking (DQM) Excellent scalability High availability Limited maintenance effort R. Mankel, Computing in the HERA-II Era
Transition to Commodity Hardware Vendor System (SGI, IBM,...) • Hardware made for professional use • Performance goals far beyond “normal household” • Components fit together • Reputation at stake • Service • Price Commodity System (PC) • Hardware made for home or small business use • Performance goals set e.g. by video games industry • Usually no vendor guarantee for whole system • Needs local support • Price R. Mankel, Computing in the HERA-II Era
Commodity (cont’d) Lesson learned from HERA-I: • use of commodity hardware and software gives access to enormous computing power, but • much effort is required to build reliable systems • In large systems, there will always be a certain fraction of • servers which are down or unreachable • disks which are broken • files which are corrupt • it is impossible to operate a complex system on the assumption that normally all systems are working • this is even more true for commodity hardware • Ideally, the user should not even notice that a certain disk has died, etc • jobs should continue R. Mankel, Computing in the HERA-II Era
Compute Server Nodes R. Mankel, Computing in the HERA-II Era
Batch Farm Nodes are Redundant jobs • each individual farm node has same functionality • not critical if ~3 nodes of 100 are down • allows to use „cheap“ PCs Scheduler (LSF 4.1) functional nodes currently broken nodes • present installation • 50 Intel P-III processors (650 + 800 MHz) for reconstruction • 40 Intel P-III (1 GHz) and 40 Intel Xeon (2.2 GHz) for analysis R. Mankel, Computing in the HERA-II Era
Development of Reconstruction Power old farm new farm new farm + tuning (2000) 2 M Events/day R. Mankel, Computing in the HERA-II Era
New: 19” Rack-Mounted CPU Servers • 1U Dual-Xeon servers provide very dense concentration of CPU power (2 x 2.4 GHz per node) • installation by memory stick • Issues to watch: • power consumption • heat / cooling H1 farm R. Mankel, Computing in the HERA-II Era
File Servers • Originally, a clear separation between desktop use (EIDE disks) and server use (SCSI, FibreChannel) • Performance & redundancy benefit greatly from RAID technique • Example: RAID5 = striping with dynamically assigned parity disk • failure of one disk can be compensated • no individual disk can become bottleneck • Advent of fast RAID controllers for EIDE disks opened the path to open commodity (EIDE) disks for server use. • Example: 3ware Escalade 78xx series controller, largely in use at DESY • up to 12 disks per controller, up to three controllers per box R. Mankel, Computing in the HERA-II Era
Commodity File Servers • Affordable disk storage decreases number of accesses to tape library • ZEUS disk space (event data): • begin 2000: 3 TB fraction FC+SCSI: 100% • mid 2001: 9 TB 67% • mid 2002: 18 TB 47% • necessary growth only possible with commodity components • commodity components need “fabric” to cope with failures 22 disks, up to 2.4 TB 12 disks, up to 2 TB DELFI2, 2002- ? DELFI3, 2000-2002 R. Mankel, Computing in the HERA-II Era
need a sophisticated mass storage concept to avoid bottlenecks Server congestion Disk congestion Scaling Aspects • The classical picture does not scale 1997 data 1997 data 1998 data 1999 data 2000 data 2002 data R. Mankel, Computing in the HERA-II Era
But: ideally the physicist should not notice that he is reading a tape Mass Storage: Should We Still Use Tapes? • Tape • 100 $ per Cartridge (200 GB), 5000 Cartridges persilo • 100 k$ persilo • 30 k$ perdrive (typical number: 10) • 0.9 $ / GB • Disk • 8 k$ per DELFI2 server (1 TB) • 8 $ / GB • both media benefit from advances in magnetic write density • relation not expected to change very much in the future (Note: prices change quickly...) R. Mankel, Computing in the HERA-II Era
Mass Storage • DESY-HHuses4 STK Powderhorn tape libraries (connected to each other) • newmediaoffer200 GB percartridge (instead of 20 GB) • Loading times will increase • Middleware is important R. Mankel, Computing in the HERA-II Era
How a (ZEUS) Physicists Accesses Data Gimme DIS events, Q2>1000, at least one D* candidate with pT>3 Query Tag database (Objectivity/DB 7.0) Generic filename & event address MDST3.D000731.T011837.cz Filename de-referencing /acs/zeus/mini/00/D000731/MDST3.D000731.T011837.cz to I/O subsystem R. Mankel, Computing in the HERA-II Era
Mass storage system Addressing Events on Datasets no Disk copy existing? Select disk cache server & stage yes no Disk copy still valid? Server up? yes establish RFIO-like connection to file analyze event R. Mankel, Computing in the HERA-II Era
Cache within a cache within a cache Access time Classical Picture CPU 10-9 s Primary Cache 2nd Level Cache Memory Cache Disk Controller Cache 10-3 s Disk Tape Library 102 s R. Mankel, Computing in the HERA-II Era
dCache Picture CPU • Disk files are only cached images of files in the tape library • Files are accessed via a unique path, regardless of server name etc • optimized I/O protocols Main board Cache 2nd Level Cache Memory Cache Disk Controller Cache Fabric Disk Cache Tape Library R. Mankel, Computing in the HERA-II Era
dCache • High performance mass storage middleware for all DESY groups • has replaced ZEUS tpfs (1997) and SSF (2000) • joint development of DESY and FNAL (also used by CDF, MINOS, SDSS, CMS) • Optimised usage of tape robot by coordinated read and writerequests (read ahead, deferred writes) • allows to build large, distributed cache server systems • Particularly intriguing features: • retry-feature during read access – job does not crash even if file or server become unavailable (as already in ZEUS-SSF) • “Write pool” used by online chain (reduces #tape writes) • reconstruction reads RAW data directly from disk pool (no staging) • grid-enabled • randomized distribution of files across the distributed file servers • avoids hot spots and congestion • scalable R. Mankel, Computing in the HERA-II Era
8 MB/s reading 0.2 MB/s staging large cache hit rate output (analysis) input (staging) bigos04 I/O Performance with Distributed Cache Servers doener bigos03 bigos06 bigos01 bigos04 bigos02 bigos05 bigos07 R. Mankel, Computing in the HERA-II Era
Tape Library d C a c h e HSM Distributedcache servers Analysis platform Physicists R. Mankel, Computing in the HERA-II Era
Tape Library HSM Distributedcache servers d C a c h e Analysis platform • Builtin redundancyat all levels • On failure of a cache server, the requested file is staged automatically to another server Physicists R. Mankel, Computing in the HERA-II Era
Simulation: the Funnel System • Automated distributed production • production reached 240 M events in 2002 • funnel is an early computing grid until Nov 2002 R. Mankel, Computing in the HERA-II Era
Event Visualization in the Era of the Grid The ZeVis event display R. Mankel, Computing in the HERA-II Era
Application Software: New ZEUS Event Display • During the HERA luminosity upgrade, new detector components have been installed which require extended 3D capabilities • Example: Micro-vertex detector (MVD), straw tube tracker (STT) • Needed a non-monolithic solution, which includes analysis software as a modular component • Client/server approach allows clear separation • lightweight client (purely C++, no experiment libraries) • server application (encapsulates legacy code) ZeVis = Zeus Visualization R. Mankel, Computing in the HERA-II Era
Straight-forward “projections” of 3D representation can be complex & confusing Layered projections with proper ordering allow to really analyze the event Integrationof 2D/3D views • Perspective views with hidden surface removal are useful for understanding geometry, but do not show the event well R. Mankel, Computing in the HERA-II Era
Architecture ZClient Main client application & GUI ZEvtBuild ZEUS event to ROOT conversion ZGeomBuild ZEUS geom to ROOT conversion ZesAgent Event agent for server ZevLib Detector objects ZevLibX ZEUS DDL interfaces other ZEUS software • ZClient is a lightweight client application which only depends on ZevLib and ROOT R. Mankel, Computing in the HERA-II Era
Data Model • Clear separation of geometry and event information to minimize amount of data transfer • Geometry: only needs to be loaded once (or few times at most) • optimize for fast drawing. All ZEUS geometry classes derive from corresponding shapes in ROOT. • Geometry can draw itself independently • present size 450 kB • Event: many ROOT-files to be transferred & stored (cached) • file contains tree of event-objects • objects contain drawable event information (e.g. track objects), refer to geometry information during drawing, cannot be drawn independently • optimize for size. Data structure similar to ZEUS miniDST • about 30 kB/evt R. Mankel, Computing in the HERA-II Era
Implementation With ROOT • Parallel implementation of 3D display and layered 2D projections is achieved by overloading “Paint()” member function • in 3D mode, use Paint() function of basic ROOT class • in 2D mode: • 3D-object contains list of 2D-representation-objects • Paint() function draws list of objects according to the desired projected shape • Fish-eye view in XY and ZR projection to allow simultaneous inspection of inner (MVD) and outer detector (muon system) • achieved by overloading TView::WCtoNDC function • Extensive use of picking functionality and “describe object below mouse pointer” • Extension of TWebfile class to integrate feedback from server • For control room, special mode which periodically loads the next online event R. Mankel, Computing in the HERA-II Era
Tag DB Event Store dCache Client-Server Motivation • Event display needs access to experiment’s event store and online system • Idea: separation of • portable lightweight client, which can run wherever ROOT can run • central geometry & event server on DESY site and connect via HTTP protocol (ROOT’s TWebFile) to pass firewall HTTP User work station Central ZeVis Server (1 Dual-Xeon 2GHz) Mass storage control room LAN office on-site LAN off-site, home WAN DESY Computer Center R. Mankel, Computing in the HERA-II Era
Apache + ROOT module + PHP4 Geometry file Geometry file Geometry file Dispatcher (PHP4) Event agent Event agent Event agent Event agent Event agent ROOT Conversion Online Events Events of the Week (PR) Internal Structure of The ZeVis Server Constants DB Geometry Builder Geometry request Tag DB Single event request Event Store File request Online Reconstruction Online event request EOTW request R. Mankel, Computing in the HERA-II Era
Components of The ZeVis Server • Single event server • random access to event from ZEUS Event Store and conversion to ROOT format on-the-fly • normally based on run/event number, optionally with query on about 300 tag variables per event (both modes involve tag database, Objectivity/DB 7.0) • staging handled by dCache system if necessary • for low latency, the “event agents” are pre-initialized and wait for a request passed to them (daemon-fashion) • 10 event agent processes per software release (pro + dev), 20 in total • Online event server • online system splits RAW events off the event stream at about 0.1 Hz • on-the-fly reconstruction and ROOT conversion on dedicated server • Geometry and propaganda events • static ROOT files on the server, which are produced offline • download of geometry file takes << 1s R. Mankel, Computing in the HERA-II Era
(no server cache) Server Performance • Less than 2 s average response time • even faster on cached events • this includes event analysis and conversion to ROOT format, excludes network Stress test • In reality, simultaneous requests will hardly occur at all • Server can stand up at least 4 requests in parallel without noticeable performance loss R. Mankel, Computing in the HERA-II Era
Input modes Zoom controls Option tabs Status information Canvas Pads Object information Graphical User Interface R. Mankel, Computing in the HERA-II Era
GUI: Single Event Server R. Mankel, Computing in the HERA-II Era
GUI: Download file mode R. Mankel, Computing in the HERA-II Era
GUI: Online Mode R. Mankel, Computing in the HERA-II Era
Projections XY and RZ These are the main views for event analysis Projections complement the geometry of the detector R. Mankel, Computing in the HERA-II Era
Fish-eye View Allows to see details in the vertex detector simultaneously with the outer detectors Also available in ZR view R. Mankel, Computing in the HERA-II Era
MVD Clusters MVD Display ROOT allows 3D view within a pad (wire frame only) R. Mankel, Computing in the HERA-II Era
Drift Chamber Raw Hits • Normally, CTD display shows assigned hits, taking drift distance into account Special mode shows cell structure and raw hits R. Mankel, Computing in the HERA-II Era
Muon Chambers and Hits Muon Hits Not yet finalized R. Mankel, Computing in the HERA-II Era
Collaborative Tools: the ZEMS System • Electronic slides presentation havebecome a standard • Remote participation at meetings is indispensable in the large collaborations of the HEP community • However: • too much time spent on meeting organization, cabling notebooks, collecting & archiving files • capturing speaker and slides in a video stream is not trivial & often unsatisfactory • sharing documents between all meeting participants is not simple and often creates unexpected problems R. Mankel, Computing in the HERA-II Era
What a Collaboration Needs • Presentation • Excellent slides reproduction for hundreds of listeners • Support for people at other labs who cannot attend the meeting in person • Access to remote resources in auditorium • Meeting organization • Simple, comfortable interface for agenda generation • Sub-tasks delegation (e.g. sessions chairmen) • Slides submission • Easy, fast method for slides submission • Last-minute changes, additional materials for discussion • Access control list • Interaction between presentations • Summary speakerscollect material from previous presentations R. Mankel, Computing in the HERA-II Era
ZEMS • Integrated system based on web technology • Strong emphasis on security = Zeus Electronic Meeting Management System R. Mankel, Computing in the HERA-II Era
repository disk Components Client 1 APACHE Document Root Client 2 Client 3 Client N R. Mankel, Computing in the HERA-II Era
Delegation of privileges R. Mankel, Computing in the HERA-II Era