180 likes | 422 Views
Tape Monitoring. Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011. Overview. From low level Tape drives; libraries Via middle layer LEMON Tape Log DB To high level Tape Log GUI SLS TSMOD What is missing? Conclusion. Low level – towards the vendors.
E N D
Tape Monitoring VladimírBahyl IT DSS TAB Storage Analytics Seminar February 2011
Overview • From low level • Tape drives; libraries • Via middle layer • LEMON • Tape Log DB • To high level • Tape Log GUI • SLS • TSMOD • What is missing? • Conclusion
Low level – towards the vendors • Oracle Service Delivery Platform (SDP) • Automatically opens tickets with Oracle • We also receive notifications • Requires “hole” in the firewall, but quite useful • IBM TS3000 console • Central point collecting all information from 4 (out of 5) libraries • Call home via Internet (not modem) • Engineers come on site to fix issues
Low level – CERN usage • SNMP • Using it (traps) whenever available • Need MIB files with SNMPTT actuators: • IBM libraries send traps on errors • ACSLS sends activity traps • ACSLS • Event log messages on multiple lines concatenated into one • Forwarded via syslog to central store • Useful for tracking issues with library components (PTP) EVENT ibm3584Trap004 .1.3.6.1.4.1.2.6.182.1.0.4 ibm3584Trap CRITICAL FORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7' EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4 SDESC Trap for library TapeAlert 004. DESC
Middle layer – LEMON • Actuators constantly check local log files • 4 situations covered: • Tape drive not operational • Request stuck for at last 3600 seconds • Cartridge is write protected • Bad MIR (Media Information Record) • Ticket is created= email is sent • All relevant informationis provided within the ticketto speedup the resolution • Workflow is followed tofind a solution Dear SUN Tape Drive maintainer team, this is to report that a tape drive T10B661D@tpsrv963 has became non-operational. Tape T05653 has been disabled. PROBABLE ERRORS 01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0 01/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error 01/28 15:33:05 10344 rlstape: TP033 - drive T10B661D@tpsrv963.cern.ch not operational IDENTIFICATION Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr: Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2 Tape Server: tpsrv963
Middle layer – Tape Log DB • CASTOR log messages from all tape servers are processed and forwarded to central database • Allows correlation of independent errors (not a complete list): • X input/output errors with Y tapes on 1 drive • X write errors on Y tapes on 1 drive • X positioning errors on Y tapes on 1 drive • X bad MIRs for 1 tape on Y drives • X write/read errors on 1 tape on Y drives • X positioning errors on 1 drive on Y drives • Too many errors on a library • Archive for 120 days all logs slit by VID and tape server • Q: What happened to this tape?
Tape Log – the data • Origin: rtcpd & taped log messages • All tape servers sending data in parallel • Content: various file state information • Volume: • Depends on the activity of the tape infrastructure • Past 7 days: ~30 GBs of text files (raw data) • Frequency: • Depends on the activity of the tape infrastructure • Easily > 1000 lines / second • Format: plain text
Tape Log – data transport • Protocol: (r)syslog log messages • Volume: ~150 KB/second • Accepted delays: YES/NO • YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file • NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed • Losses acceptable: YES (to some small extent) • The system is only used for statistics or slow reactive monitoring • Serious problem will reoccur elsewhere • We use TCP in order not to loose messages
Tape Log – data storage • Medium: Oracle database • Data structure: 3 main tables • Accounting • Errors • Tape history • Amount of data in store: • 2 GB • 15-20 millions of records (2 years worth of data) • Aging: no, data kept forever
Tape Log – data processing • No additional post processing, once data is stored in database • Data mining and visualization done online • Can take up to a minute
High level – Tape Log GUI • Oracle APEX on top of data in DB • Trends • Accounting • Errors • Media issues • Graphs • Performance • Problems • http://castortapeweb
Tape Log – pros and cons • Pros • Used by DG in his talk! • Using standard transfer protocol • Only uses in-house supported tools • Developed quickly; requires little/no support • Cons • Charting limitations • Can live with that; see point 1 – not worth supporting something special • Does not really scale • OK if only looking at last year’s data
High level – SLS • Service view for users • Life availability information as well as capacity/usage trends • Partially reuses Tape Log DB data • Information organized per VO • Text and graphs • Per day/week/month
TSMOD • Tape Service Manager on Duty • Weekly changing role to • Resolve issues • Talk to vendors • Supervise interventions • Acts on twice-daily summary e-mail which monitors: • Drives stuck in (dis-)mounting • Drives not in production without any reason • Requests running or queued for too long • Queue size too long • Supply tape pools running low • Too many disabled tapes since the last run • Goal: have one common place to watch
What is missing? • We often need the full chain • When was the tape last time successfully read? • On which drive? • What was the firmware of that drive? • Users hidden within upper layers • We do not know which exact user is right now reading/writing • The only information we have is the experiment name and that is deducted from the stager hostname • Details investigations often require request ID
Conclusion • CERN has extensive tape monitoring covering all layers • The monitoring is fully integrated with the rest of the infrastructure • It is flexible to support new hardware (e.g. higher capacity media) • The system is being improved as new requirements arise