120 likes | 272 Views
Tape Operations Update. Vladimír Bahyl IT FIO-TSI CERN. Agenda. Progress on issues (since the last meeting) Current equipment and challenges Development changes Operational changes Conclusion. Progress on issues. NI_FAILURE Problem still present
E N D
TapeOperationsUpdate Vladimír Bahyl IT FIO-TSI CERN
Agenda • Progress on issues (since the last meeting) • Current equipment and challenges • Development changes • Operational changes • Conclusion 2
Progress on issues • NI_FAILURE • Problem still present • Simple procedure exist = no need to reinstall • tplabel command • By default, existing labels are not overwritten • – f option introduced to force relabelling • Cmonitd • No longer used at CERN 3
Equipment today • 25 PB total (around 50% free) • IBM • 2 libraries • ~12 000 slots; 700 GB each • 60 TS1120 drives • Sun • 4 libraries • ~ 36 000 slots; 500 GB each • 60 T10000A drives 4
Equipment near future • Tape space sufficient for 2008 • Unbalanced • New drives • IBM TS1130: ~160 MB/s, 1 TB cartridges • Sun T10000B: ~130 MB/s, 1 TB cartridges • IBM High density frame 5
Challenges • Atlas write low rate partially caused by additional mounts due to a CASTOR policy bug • Alice rate affected by small files from users writing to default pool 6
Development 1/3 • Patch free kernel version (2.1.6-8) • Goal: by SLC5 do not use any CASTOR specific kernel patches • All necessary settings moved to CASTOR tape layer • New SCSI tape driver options introduced: TAPE ST_ASYNC_WRITES 0 TAPE ST_BUFFER_WRITES 0 TAPE ST_LONG_TIMEOUT 3600 TAPE ST_READ_AHEAD 0 TAPE ST_TIMEOUT 900 • Testing on few machines already on SLC4 7
Development 2/3 • Library failure handling (2.1.7-3) • Now possible to overcome short temporary failures of Sun libraries • Options introduced: TAPE ACS_MOUNT_LIBRARY_FAILURE_HANDLING retry 3 300 TAPE ACS_UNMOUNT_LIBRARY_FAILURE_HANDLING retry 3 300 • Use non-labeled tapes (2.1.7-3) • By default, we use AUL ( ) tape labels • NL tapes are now also supported American National Standard label and American National Standard user label 8
Development 3/3 • Option to log to SysLog (2.1.7-4) • See the talk of Giuseppe Lo Re • Can log to DLF since the last meeting • SysLog now also supported • Uses local0 and local1 facilities • Options needed: TAPE TPLOGGER SYSLOG local0.info;local1.info @castortapelog local0.*;local1.* /var/log/castor-tape.log • Log example: Jun 6 15:52:23 tpsrv623 rtcpd[16828]: "TYPE"="RT044 – Request statistics", "FUNC"="rtcpd_FreeResources", "MESSAGE"="Request statistics", "REQUESTTYPE"="READ", "VID"="T07106", "MOUNTTIME"="163", "SERVICETIME"="209", "WAITTIME"="164“, "TRANSFERTIME"="7", "POSITIONTIME"="36", "DATAVOLUMEMB"="115.570068", "DATARATEMBS"="16.510010", "FILES"="1", "DGN"="T10KR1", "VOLREQID"="77219", "CLIENTNAME"="stage”, "CLIENTUID"="14029", "CLIENTGID"="1474", "CLIENTHOST"="c2publicsrv102.cern.ch", "TPVID"="T07106", "REQUESTSTATE"="successful“ 9
Operational changes 1/2 • RTCPD self monitor enabled • RTCP daemon sometimes gets stuck • Self monitor terminates the job and does proper cleanup RTCOPYD SELF_MONITOR YES RTCOPYD MOUNT_TIME 900 • SNMP traps handling • IBM libraries send SNMP traps directly Volser CLN168JA, A Enterprise Tape cleaning cartridge has expired. • ACSLS sends traps on behalf of Sun libraries ACSLS info Lsm 0,7 number of drives changed from 6 to 7. Lsm will be updated. • LEMON creates alarms 10
Operational changes 2/2 • TSMOD (Tape Service Manager on Duty) • Receives daily report TD01E | Drive Down Without Reason | DN 3592B2 35922005@tpsrv135 DOWN 20530 (No_dedication) None TD03E | Job running for too long | DA 994BR0 994B0618@tpsrv635 RUNNING 27769 (No_dedication) P17080 P17080 R 30726 (stage,st)@lxmrrk2707.cern.ch TQ01E | DGN Queue Wait Time Long | Average queue wait time in T10KR1 is 14729 seconds TQ02E | Queue Request Too Old | Q T10KR1 T13388 R 143229 (stage,st)@c2cmssrv102.cern.ch 37990 • Follows procedures according to the error code • Handles most other common issues • E.g. contacting vendors for problems • Weekly rotation 11
Conclusion • Tape capacity sufficient for 2008 • New tape related CASTOR features are constantly being put into production • We are trying to simplify our setup and automate the problem handling 12