Tape Monitoring

Tape Monitoring VladimírBahyl IT DSS TAB Storage Analytics Seminar February 2011

Overview • From low level • Tape drives; libraries • Via middle layer • LEMON • Tape Log DB • To high level • Tape Log GUI • SLS • TSMOD • What is missing? • Conclusion

Low level – towards the vendors • Oracle Service Delivery Platform (SDP) • Automatically opens tickets with Oracle • We also receive notifications • Requires “hole” in the firewall, but quite useful • IBM TS3000 console • Central point collecting all information from 4 (out of 5) libraries • Call home via Internet (not modem) • Engineers come on site to fix issues

Low level – CERN usage • SNMP • Using it (traps) whenever available • Need MIB files with SNMPTT actuators: • IBM libraries send traps on errors • ACSLS sends activity traps • ACSLS • Event log messages on multiple lines concatenated into one • Forwarded via syslog to central store • Useful for tracking issues with library components (PTP) EVENT ibm3584Trap004 .1.3.6.1.4.1.2.6.182.1.0.4 ibm3584Trap CRITICAL FORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7' EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4 SDESC Trap for library TapeAlert 004. DESC

Middle layer – LEMON • Actuators constantly check local log files • 4 situations covered: • Tape drive not operational • Request stuck for at last 3600 seconds • Cartridge is write protected • Bad MIR (Media Information Record) • Ticket is created= email is sent • All relevant informationis provided within the ticketto speedup the resolution • Workflow is followed tofind a solution Dear SUN Tape Drive maintainer team, this is to report that a tape drive T10B661D@tpsrv963 has became non-operational. Tape T05653 has been disabled. PROBABLE ERRORS 01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0 01/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error 01/28 15:33:05 10344 rlstape: TP033 - drive T10B661D@tpsrv963.cern.ch not operational IDENTIFICATION Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr: Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2 Tape Server: tpsrv963

Middle layer – Tape Log DB • CASTOR log messages from all tape servers are processed and forwarded to central database • Allows correlation of independent errors (not a complete list): • X input/output errors with Y tapes on 1 drive • X write errors on Y tapes on 1 drive • X positioning errors on Y tapes on 1 drive • X bad MIRs for 1 tape on Y drives • X write/read errors on 1 tape on Y drives • X positioning errors on 1 drive on Y drives • Too many errors on a library • Archive for 120 days all logs slit by VID and tape server • Q: What happened to this tape?

Tape Log – the data • Origin: rtcpd & taped log messages • All tape servers sending data in parallel • Content: various file state information • Volume: • Depends on the activity of the tape infrastructure • Past 7 days: ~30 GBs of text files (raw data) • Frequency: • Depends on the activity of the tape infrastructure • Easily > 1000 lines / second • Format: plain text

Tape Log – data transport • Protocol: (r)syslog log messages • Volume: ~150 KB/second • Accepted delays: YES/NO • YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file • NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed • Losses acceptable: YES (to some small extent) • The system is only used for statistics or slow reactive monitoring • Serious problem will reoccur elsewhere • We use TCP in order not to loose messages

Tape Log – data storage • Medium: Oracle database • Data structure: 3 main tables • Accounting • Errors • Tape history • Amount of data in store: • 2 GB • 15-20 millions of records (2 years worth of data) • Aging: no, data kept forever

Tape Log – data processing • No additional post processing, once data is stored in database • Data mining and visualization done online • Can take up to a minute

High level – Tape Log GUI • Oracle APEX on top of data in DB • Trends • Accounting • Errors • Media issues • Graphs • Performance • Problems • http://castortapeweb

High level – Tape Log GUI

Tape Log – pros and cons • Pros • Used by DG in his talk! • Using standard transfer protocol • Only uses in-house supported tools • Developed quickly; requires little/no support • Cons • Charting limitations • Can live with that; see point 1 – not worth supporting something special • Does not really scale • OK if only looking at last year’s data

High level – SLS • Service view for users • Life availability information as well as capacity/usage trends • Partially reuses Tape Log DB data • Information organized per VO • Text and graphs • Per day/week/month

High level – SLS

TSMOD • Tape Service Manager on Duty • Weekly changing role to • Resolve issues • Talk to vendors • Supervise interventions • Acts on twice-daily summary e-mail which monitors: • Drives stuck in (dis-)mounting • Drives not in production without any reason • Requests running or queued for too long • Queue size too long • Supply tape pools running low • Too many disabled tapes since the last run • Goal: have one common place to watch

What is missing? • We often need the full chain • When was the tape last time successfully read? • On which drive? • What was the firmware of that drive? • Users hidden within upper layers • We do not know which exact user is right now reading/writing • The only information we have is the experiment name and that is deducted from the stager hostname • Details investigations often require request ID

Conclusion • CERN has extensive tape monitoring covering all layers • The monitoring is fully integrated with the rest of the infrastructure • It is flexible to support new hardware (e.g. higher capacity media) • The system is being improved as new requirements arise

Tape Monitoring

Tape Monitoring

Presentation Transcript

Duct Tape

Magnetic Tape

Tape Diagrams

Wrist Tape

Tape Report

Red Tape

Duct Tape ( Duck Tape)

Duct Tape

Shin Tape

Light Tape

Tape Mannequins

Tesa Tape

Duct Tape

TAPE 15

Duct Tape

TAPE 13

Master Tape

Edging Tape

Packaging Tape ,Self Adhesive Tape

Adhesive tape

Graphite Tape

gaffers tape