1 / 19

Tivoli Storage Manager: Troubleshooting Case Studies

Tivoli Storage Manager: Troubleshooting Case Studies. Introduction. Agenda. A Troubleshooting Methodology Some Points to Ponder The Case of the Subtle Problem The Case of the Black Hole The Case of the Red Herring The Case of the Hardware Headache. Troubleshooting Methodology.

andie
Download Presentation

Tivoli Storage Manager: Troubleshooting Case Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tivoli Storage Manager:Troubleshooting Case Studies

  2. Introduction

  3. Agenda • A Troubleshooting Methodology • Some Points to Ponder • The Case of the Subtle Problem • The Case of the Black Hole • The Case of the Red Herring • The Case of the Hardware Headache

  4. Troubleshooting Methodology • Gather the Known Facts • It’s important to know what is normal in your environment • Catalogue the suspects • TSM Client • TSM Server • Network • Library • Eliminate Suspects • Via logic • Via testing • Form a tentative conclusion • Verify the conclusion

  5. Points to Ponder • Backups place uniquely large strains on the environment • Heavy throughput can overstress marginal hardware • Heavy throughput can also reveal otherwise-invisible operating system limits • Hardware can pass self-test or routine diagnostics and still break down under severe load. • Tivoli Storage Manager is for the most part mature code • Bugs tend to appear mostly in new functionality • Don’t fall victim to the tendency to blame the software first

  6. The Case of the Subtle Problem--Background • The problem: Random occurrences of corrupted backup tapes • Known Facts • Netfinity TSM Server, four SCSI tape drives on independent channels • Some backup tapes are corrupted, apparently at random • No errors in the activity log • Seems to happen only on large (> 10 MB) files • Tape drives had been replaced without resolving problem • Unknown (but discoverable, had anyone been looking) • Corrupted backups took 10 times longer than expected to complete • Corrupted backups always happened when writing to tape drive on channel C • Initial Premise: TSM code is breaking down

  7. The Case of the Subtle Problem--Troubleshooting • Only the clients could be eliminated as suspects, leading to: • The Brute Force Approach • Perform backups to all possible drive combinations while tracing activity • Force specific drive usage by updating others off-line • Perform restorations to verify good backups • Test Results: • Channel A (Drive 1)—OK • Channel B (Drive 2)—OK • Channel C (Drive 3)—Backup time excessive, test file corrupted • Channel D (Drive 4)—OK

  8. The Case of the Subtle Problem--Troubleshooting • The Next Step • Trace a backup to Drive 3 using debug server • Test Results: • Channel C (Drive 3)—Backup time still excessive, test file OK • ????!! • Obvious Solution • Leave the debug server in place (Just kidding…) • Time to think • What does the debug code do differently? • Writes lots of logs to disk, which slows throughput to tape, otherwise no change • Tentative Conclusion: Could there be a load-related hardware breakdown?

  9. The Case of the Subtle Problem--Troubleshooting • The Next Step • Figure out what component is breaking down • Tape Drive • SCSI Card • Network Card • Test Sequence • Tape-to-tape from Drive A to Drive C (eliminate network) • Swap cables on Drive C to known good bus (eliminate Drive) • Results • Tape to tape corrupted, so network card’s not the problem • After cable swap, Drive C OK, so drive’s not the problem • Solution: Replace SCSI card C

  10. The Case of the Black Hole -- Background • Reported problem: recent data restoration attempts encountered unexpected tape read errors. (Sound familiar?) • Two identical AIX TSM Servers, sharing one IBM 3494 library. • four SCSI tape drives accessed via SAN Data Gateway on server A • Two SCSI tape drives accessed via LAN on server B • Some backup tapes are corrupted on Server A (discovered during attempted data restoration--ouch!!!) • Known (but disregarded) • Scratch tape usage was much higher than expected on Server A • Unknown (but shouldn’t have been) • Tape write errors during backups were commonplace • Initial Premise: TSM code is breaking down

  11. The Case of the Black Hole --Troubleshooting • Data Gathering • errors in the activity log are primarily ANR8311E, I/O error accessing the tape device for WRITE. Almost no READ errors. • errpt –a showed 324 tape device error entries over an eight day period • Over 200 volumes with access set to readonly , most only 2% used • TSM sets access to readonly when a tape exhibits write problems • Initial Conclusion • Read errors are an artifact of previous write problems • Write problems are happening at O/S or hardware level • Next Steps • Isolate the failure

  12. The Case of the Black Hole --Troubleshooting • Fault Isolation • Failures appear to occur only on server using fibre-accessed drives • SAN Data Gateway had out-of-support firmware • Hardware engineer found similar problems in database with old firmware • Conclusion • Write problems are due to old SAN gateway firmware • Solution • Update firmware.

  13. The Case of the Red Herring -- Background • Filesystem backup on a Tru64 Server taking excessively long time • Known Facts • Four-processor True64 client • Single filesystem, 167 MB in size, over 2,000,000 small files • Incremental backup taking over 24 hours, getting ANS1074I User abort Error • Initial Premise: TSM client code problem • Focus immediately went to User Abort error • Extended conversations with TSM Support had not isolated the problem

  14. The Case of the Red Herring -- Troubleshooting • The Suspects • Client Configuration • Client Tuning • Server Tuning • Network Bandwidth • Client Load • TSM Code • Initial Troubleshooting consisted of performing a monitored backup • After 28 hours, still going with no problems (about 50% complete)

  15. The Case of the Red Herring -- Troubleshooting • Inspection of the Client O/S revealed: • uptime revealed that the machine had a persistent load average between 12 and 16, or about 3-4 per processor • vmstat output confirmed that the system was consistently CPU-bound, with CPU consistently pegged at 100% (but no swapping, indicating no memory issues) • Iostat showed no evidence of disk performance issues • The TSM client was using about 50% of one processor. • Inspection of client tuning revealed: • Client Compression turned ON • Conclusion • The client machine is overloaded, and the one TSM parameter that would make things worse is set incorrectly.

  16. The Case of the Hardware Headache--Background • Errors in the activity log when performing backups. • ANR8302E I/O error on drive • ANR8359E Media fault detected • Six possible causes immediately suggest themselves: • Dirty tape read/write heads • Out-dated Atape driver or firmware • Defective media • Defective or out-of-alignment tape drives • Defective hardware in the communication path. • A defect in TSM

  17. The Case of the Hardware Headache--Troubleshooting • Three possible causes could be logically eliminated: • Dirty tape read/write heads is unlikely in this case, as self-cleaning media is in use, and the library is new • Out-dated Atape driver or firmware was quickly eliminated by checking the installed version • Defective media is possible, but a 30% failure rate with quality media is not to be believed. • Leaving us with three prime suspects • Defective or out-of-alignment tape drives • Defective hardware in the communication path. • A code defect.

  18. The Case of the Hardware Headache--Troubleshooting • Attempts to verify the tape drives using AIX’s mksysb utility seemed to indicate no defects • This cost some time and confusion. Mksysb seems to ignore hardware errors, and usually results in a useable backup! • If attempting to make a non-TSM backup as hardware test, use a utility like tar or cpio • Piecemeal hardware replacements eventually resulted in replacing the library SCSI card, after which functions returned to normal. • In retrospect, more attention should have been paid to the fact that both tape drives, plus their replacements, exhibited the identical behavior.

  19. Summary • First, it helps to know what’s normal • How long should a backup take? • What is normal scratch tape usage? • Second, don’t assume all backup problems are a software issue. Sit down and list all the elements involved. • Third, follow a logical problem-solving sequence to eliminate possible causes.

More Related