160 likes | 298 Views
BaBar Storage at Lyon. HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999. Rolf Rumler, John O’Neall, Philippe Gaillardon, Internal Group IN2P3 Computing Center Villeurbanne, France URL http://www.in2p3.fr/CC. BABAR Experiment.
E N D
BaBar Storage at Lyon HEPIX and Mass Storage SLAC, California, U.S.A. 8 October 1999 Rolf Rumler, John O’Neall, Philippe Gaillardon, Internal Group IN2P3 Computing Center Villeurbanne, France URL http://www.in2p3.fr/CC
BABAR Experiment • High-energy-physics experiment, started in July at SLAC • The IN2P3 Computing Center is the “mirror” computing site for Babar computing. • We will receive a copy of all Babar data (well, almost). • Also will produce simulated data, which will be stored as well as sent to SLAC. • Estimated data rate is on the order of 350 TB per year • SLAC has chosen HPSS to store this data; the CCIN2P3 is following their example. • Our initial goal is to do the same thing as SLAC for BABAR. • Files >~ 2 GB
file file.lock How it works control ooss_Mig data Objectivity M (pftp) HPSS P ooss_Pur C amshpss R(3) (pfpt) L R(2) ooss_Stage R(1) (Creation, Lecture (read), Migration, Purge, Recovery)
HPSS Configuration • For the moment, Babar only ==> like SLAC • One single Storage Class in one single COS • Tape only = Storagetek Redwoods, 9840 and MAGSTARs under study • No mirroring • All access to data via pftp_client • Additional tools from SLAC (Andy Hanushevsky)
Objectivity Configuration Summary • 1 SUN E4500 (4 CPUs) + 2 SUN A3500, in total about 1.1 TB RAID 5, under Veritas VM/FS, with actual BaBar data • 1 SUN E4500 + 2 SUN A3500 as above, no data yet • 1 SUN E450 (4 CPUs) linked to IBM VSS disk space, about 400 GB RAID 5, with Veritas: tests starting next week • Intention: to have different Objy servers for different types of data
HPSS Core Server • RS/6000 F50 • 4 CPUs, 1 GB memory • 2 x 4.5 GB mirrored system disks • 24 GB internal SSA disks for SFS (mirrored) • AIX 4.3.2 • Ethernet (control network) • DCE, Encina, SAMMI • OMI driver for Redwoods • Access to Storagetek ACL by ACSLS
HPSS Movers • Preliminary configuration, while waiting for choice of best machine to use with Gigabit Ethernet; also lacking BABAR usage profile • (Historical problem: Changed from ATM to Hi-speed Ethernet just as HPSS was arriving) • RS/6000 390, replacement under study (43P260?) • 1 CPU, 256 MB memory • 2 x 4.5 GB mirrored system disks • AIX 4.3.2 • Ethernet control network, Fast Ethernet data network
Performance • Reminder: Temporary mover/network configuration • Performance limited by: • Fast Ethernet data path (100 Mbps ==> < 8 MB/sec). • Mover CPUs: ~50 % occupied. • Punctual transfer: ~ 5 MB/sec per tape • Global rate slower because of cartridge mount and positioning time, ~ 3.5 MB/sec • Global max transfer rate: > 16 MB/sec (write), ~ 3 MB/sec (read)
Particular problem: Tape errors • HPSS and Redwood cartridges, at least with our test usage pattern, do not seem to cohabit well, especially for random reading of ~ 2-GB files. • Redwoods need regular maintenance (every 100 hours or less) ==> need to be scheduled. Need stats from controllers. • Need effective maintenance from Storagetek. • Need tools to monitor volume and drive errors. • Need for HPSS to react automatically to volume and drive errors. (Example: unable to dismount cartridge ==> HPSS keeps trying indefinitely; drive errors during writing can turn drive into “black hole”)
The good(?) news • Storagetek taking our problems seriously • Adopted several measures to “minimize our dissatisfaction” (thru end of 1999): • Maintenance presence > 1 hour/day • Check cartridges to see if any from known-bad batches • Problem “PINNACLE”, max severity, to handle problems • Procedure to follow up on all tapes and drives sent to Storagetek for analysis or repair • Permanent spare SD-3 at IN2P3 + replacement priority • Daily log analysis, to monitor errors and report them back to us • Goal: Anticipate bad vols or drives and replace before they break
Other problem: HPSS manageability • SAMMI doesn’t make it for us. • Need to receive a user-configurable subset of the “alarms and events” messages in a script, which can then take the appropriate actions. • The “appropriate actions” require that appropriate commands be available in command-line form: • lock a volume or device; • forward a message via e-mail, Patrol, beeper or other means; • Many messages are not sufficiently precise or information is lacking.
Summary • Greatest current problem is due to errors from Redwood drives; we are studying this problem with Storagetek France. This problem is exacerbated by the next one. • Greatest long-term problem is manageability, specifically, the lack of adequate non-graphic interfaces to HPSS to permit effective, automatic error detection, performance monitoring and alarm propagation.