1 / 18

Tape Monitoring

Tape Monitoring. Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011. Overview. From low level Tape drives; libraries Via middle layer LEMON Tape Log DB To high level Tape Log GUI SLS TSMOD What is missing? Conclusion. Low level – towards the vendors.

armen
Download Presentation

Tape Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tape Monitoring VladimírBahyl IT DSS TAB Storage Analytics Seminar February 2011

  2. Overview • From low level • Tape drives; libraries • Via middle layer • LEMON • Tape Log DB • To high level • Tape Log GUI • SLS • TSMOD • What is missing? • Conclusion

  3. Low level – towards the vendors • Oracle Service Delivery Platform (SDP) • Automatically opens tickets with Oracle • We also receive notifications • Requires “hole” in the firewall, but quite useful • IBM TS3000 console • Central point collecting all information from 4 (out of 5) libraries • Call home via Internet (not modem) • Engineers come on site to fix issues

  4. Low level – CERN usage • SNMP • Using it (traps) whenever available • Need MIB files with SNMPTT actuators: • IBM libraries send traps on errors • ACSLS sends activity traps • ACSLS • Event log messages on multiple lines concatenated into one • Forwarded via syslog to central store • Useful for tracking issues with library components (PTP) EVENT ibm3584Trap004 .1.3.6.1.4.1.2.6.182.1.0.4 ibm3584Trap CRITICAL FORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7' EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4 SDESC Trap for library TapeAlert 004. DESC

  5. Middle layer – LEMON • Actuators constantly check local log files • 4 situations covered: • Tape drive not operational • Request stuck for at last 3600 seconds • Cartridge is write protected • Bad MIR (Media Information Record) • Ticket is created= email is sent • All relevant informationis provided within the ticketto speedup the resolution • Workflow is followed tofind a solution Dear SUN Tape Drive maintainer team, this is to report that a tape drive T10B661D@tpsrv963 has became non-operational. Tape T05653 has been disabled. PROBABLE ERRORS 01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0 01/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error 01/28 15:33:05 10344 rlstape: TP033 - drive T10B661D@tpsrv963.cern.ch not operational IDENTIFICATION Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr: Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2 Tape Server: tpsrv963

  6. Middle layer – Tape Log DB • CASTOR log messages from all tape servers are processed and forwarded to central database • Allows correlation of independent errors (not a complete list): • X input/output errors with Y tapes on 1 drive • X write errors on Y tapes on 1 drive • X positioning errors on Y tapes on 1 drive • X bad MIRs for 1 tape on Y drives • X write/read errors on 1 tape on Y drives • X positioning errors on 1 drive on Y drives • Too many errors on a library • Archive for 120 days all logs slit by VID and tape server • Q: What happened to this tape?

  7. Tape Log – the data • Origin: rtcpd & taped log messages • All tape servers sending data in parallel • Content: various file state information • Volume: • Depends on the activity of the tape infrastructure • Past 7 days: ~30 GBs of text files (raw data) • Frequency: • Depends on the activity of the tape infrastructure • Easily > 1000 lines / second • Format: plain text

  8. Tape Log – data transport • Protocol: (r)syslog log messages • Volume: ~150 KB/second • Accepted delays: YES/NO • YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file • NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed • Losses acceptable: YES (to some small extent) • The system is only used for statistics or slow reactive monitoring • Serious problem will reoccur elsewhere • We use TCP in order not to loose messages

  9. Tape Log – data storage • Medium: Oracle database • Data structure: 3 main tables • Accounting • Errors • Tape history • Amount of data in store: • 2 GB • 15-20 millions of records (2 years worth of data) • Aging: no, data kept forever

  10. Tape Log – data processing • No additional post processing, once data is stored in database • Data mining and visualization done online • Can take up to a minute

  11. High level – Tape Log GUI • Oracle APEX on top of data in DB • Trends • Accounting • Errors • Media issues • Graphs • Performance • Problems • http://castortapeweb

  12. High level – Tape Log GUI

  13. Tape Log – pros and cons • Pros • Used by DG in his talk! • Using standard transfer protocol • Only uses in-house supported tools • Developed quickly; requires little/no support • Cons • Charting limitations • Can live with that; see point 1 – not worth supporting something special • Does not really scale • OK if only looking at last year’s data

  14. High level – SLS • Service view for users • Life availability information as well as capacity/usage trends • Partially reuses Tape Log DB data • Information organized per VO • Text and graphs • Per day/week/month

  15. High level – SLS

  16. TSMOD • Tape Service Manager on Duty • Weekly changing role to • Resolve issues • Talk to vendors • Supervise interventions • Acts on twice-daily summary e-mail which monitors: • Drives stuck in (dis-)mounting • Drives not in production without any reason • Requests running or queued for too long • Queue size too long • Supply tape pools running low • Too many disabled tapes since the last run • Goal: have one common place to watch

  17. What is missing? • We often need the full chain • When was the tape last time successfully read? • On which drive? • What was the firmware of that drive? • Users hidden within upper layers • We do not know which exact user is right now reading/writing • The only information we have is the experiment name and that is deducted from the stager hostname • Details investigations often require request ID

  18. Conclusion • CERN has extensive tape monitoring covering all layers • The monitoring is fully integrated with the rest of the infrastructure • It is flexible to support new hardware (e.g. higher capacity media) • The system is being improved as new requirements arise

More Related