1 / 15

PSeries Hardware Support Center Error Log Analysis and Coordination: 12 Years of rsense

The slides that follow (excepting this one) are meant for poster board display to be arranged on tri-fold poster as follows. rsense: pSeries Support Center Error Log AnalysisDan Henderson p and i Series Availability Lead. rsense: pSeries Support Center Error Log Decode/Analysis and Correlation.

jana
Download Presentation

PSeries Hardware Support Center Error Log Analysis and Coordination: 12 Years of rsense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. pSeries Hardware Support Center Error Log Analysis and Coordination: 12 Years of rsense Daniel J. Henderson p and i Series HW Availability Lead

    4. Traditional HW Error Logging in IBM RS/6000 Systems HW Platform and device driver errors logged in OS error Log Information Essential For repair logged in Customer/Servicer Readable Form A Service Request Code Number for lookup in service publications General Description of type of failure FRU numbers telling what parts to replace for the failure Detailed Information, known as “sense data” explaining exact nature of the failure and associated hardware state logged in a an ASCII hex format “sense data” Error Log Analysis programs in OS Identified log entries and reported on SRC and FRU callouts sufficient to direct service repair, But Very little decoding of Sense Data Very little, if any, correlation of multiple errors in log to either Modify the hardware action plan or threshold recoverable errors

    5. Sample AIX Log Entry

    6. Support Center Error Log Analysis : dsense In a hardware support center, decoding of sense data originally proceeded manually to: Determine pattern of errors across multiple systems to look for pervasive issues Provide additional fault detection/isolation when original hardware action plan provided did not satisfy customer needs dsense was created in the early ’90s to automate translating of hex bytes to give: A human readable description of the pertinent data of each byte A “Bottom-Line” analysis of what each error log entry, creating a one line description of each error A Summarization of multiple errors using the one-line description to give a much more accurate picture of system behavior over multiple days and multiple log entries dsense eventually shipped as part of AIX diagnostics to allow on the spot analysis of errors rather than waiting for data to be transmitted to a support center.

    7. pSeries Environment Error Log Analysis Challenge: Rsense response In pSeries a single hardware platform can host multiple OS images and I/O virtualization. On high end systems a hardware management console consolidate error logs from multiple OS images to report basic error information for Service, but not detailed support center information Requirements for detailed support center analysis even greater than before Rsense program created concurrent with pSeries to provide that level of support center analysis. Functionality expanding with advances in partitioning and virtualization to provide Cross OS and platform Summarization of multiple logs Correlation of log entries to modify parts replacement strategies Thresholding of soft errors Pervasive issue detection

    8. Same Log Filtered Through Rsense (Abbreviated 1/3)

    9. Same Log Filtered Through Rsense (Abbreviated) 2-3

    10. Same Log Filtered Through Rsense (Abbreviated) 3-3

    11. rsense internals

    12. rsense Scripting Language

    13. Sample rsense Customized Summary

    14. Multiple Log Coordination Application One Common hardware shared across two “systems” or operating system images. Shared hardware unable to communicate error information directly Any single OS instance unable to localize source of the fault Error Log Coordination could determine if fault is with one node or the other, or the device in the middle

    15. Multiple Log Coordination: Application Two Fault Encountered at IPL must be coordinated with previous run-time event Graphically, previous example log for a system showed:

    16. rsense in Product Engineering PFA and Data Mining In a support center, rsense scripts are easily written to mine called-home error log entries to investigate pervasive issues and to very quickly make ad-hoc studies. Some advantages in using rsense over other scripting methods In-depth analysis of sense data for many different platform and devices have been written for pSeries. Since rsense separates the decoding of the log format from the decoding of sense data, same script can be written to do analysis of Linux, AIX and service processor firmware logs Built in functions of rsense simplfy the process of script writing

    17. rsense Future rsense continues to be enhanced to meet support center needs Possible future activities for consideration: Parsing and decoding Service Action Event log of Hardware Management Console Summer 2005 Support for Linux evlog as it becomes adopted for logging of pSeries Support for analysis of system soft errors as these are called home in pSeries Incorporation of rsense capabilities for decode of logs generated for xSeries product space

More Related