1 / 42

Developing For The Windows Hardware Error Architecture

Developing For The Windows Hardware Error Architecture. John Strange Software Design Engineer Windows Kernel Microsoft Corporation. Outline. Windows Hardware Error Architecture (WHEA) Overview What does it mean to develop for WHEA? Implement Required Functionality

clara
Download Presentation

Developing For The Windows Hardware Error Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing For The Windows Hardware Error Architecture John StrangeSoftware Design EngineerWindows Kernel Microsoft Corporation

  2. Outline • Windows Hardware Error Architecture (WHEA) Overview • What does it mean to develop for WHEA? • Implement Required Functionality • Extending WHEA with Platform Specific Hardware Error Driver (PSHED) Plug-ins • Implementing Firmware Error Handlers • Baseboard Management Controller (BMC) Interaction • Status and Roadmap • Call to Action

  3. Terminology • BERT – Boot Error Record Table • ERST – Error Record Serialization Table • Error Packet – Structure describing the error information extracted from a specific error source • Error Record – Full description of an error event • Error Source – Hardware resource that notifies software of errors • ETW – Event Tracing For Windows • HEST – Hardware Error Source Table • LLHEH – Low-level Hardware Error Handler • PSHED – Platform-Specific Hardware Error Driver • WHEA – Windows Hardware Error Architecture • WMI – Windows Management Interface

  4. WHEA – In A Nutshell • Common error record format • Management applications benefit • Pre-boot and out-of-band applications • Error source discovery • Fine-grained control of error sources • Common error handling flow • All hardware errors processed by same code path • Hardware error abstractions become operating system first-class citizens • Enables error source management

  5. Provided by: Microsoft ISV/IHV Code Gen WHEA – Overview Management/Reporting Applications ETW Error Notifications WMI Management Interface Kernel HAL PCI.SYS LLHEH LLHEH Platform-Specific Hardware Error Driver Plug-in Hardware/Firmware

  6. WHEA Error Flow Corrected Errors PSHED Build Error Record Contained Schedule Work Item End Process Error Worker Thread Recovered Send Notifications End Persist Error Record Plug-in Bugcheck WheaReportHwError WHEA_ERROR_PACKET

  7. Developing For WHEA • Implementing required functionality • Error Source Enumeration • Error Record Persistence • Error Injection • Adding value by extending WHEA with PSHED Plug-ins • Firmware Error Handlers • Error Reporting/Management Applications

  8. x86/x64 Default Error Source Support • Machine Check Exceptions • Corrected Machine Checks • Non-Maskable Interrupt • PCI Express AER • BOOT Error

  9. Itanium Default Error Source Support • Machine Check Exceptions • Corrected Machine Checks • Corrected Platform Errors • PCI Express AER • INIT

  10. Default Error Source Support • For complete details on the default error source support, including default parameters, implemented by the PSHED, see the PSHED Plug-in Developer’s Guide

  11. Required FunctionalityError source enumeration • PSHED implements support for default error sources • Configured with default parameters • Platform overrides default error sources only if necessary

  12. Required FunctionalityError source enumeration • If default error source list or control parameters need to be augmented, there are two complementary ways to achieve this • Implement an ACPI HEST • Implement Error Source Discovery support in a PSHED Plug-In

  13. Required FunctionalityError record persistence • Platform must implement support for error record persistence • Read/write/clear error records from persistent store • Single unrecoverable error record written prior to bugchecking system • WHEA reads, processes, and then clears any existing error records during subsequent boot • Pre-boot and OOB applications can also consume error records • WHEA requires that platform implement support for writing at least one error record • x86/x64 requires at least 1 KB • Consider possible limitations on 1 KB • Itanium requires at least 100 KB

  14. Required FunctionalityError record persistence • x86/x64 • Preferred solution: Implement support for PSHED’s hardware persistence interface • Implement ACPI ERST • Implement PSHED Plug-In • Itanium • Preferred solution: Implement error record serialization support in Get/SetVariable • Modify the EFI GetVariable/SetVariable routines so they can support error record persistence • Implement PSHED Plug-In • WHEA error record persistence model coexists with Itanium SAL error record persistence

  15. Hardware Serialization

  16. Hardware SerializationError log address range • Range of memory specifically designated to be used for error record serialization • INT 15H E820H on PCAT • Identified as AddressRangeMemory w/ new extended attribute AddressRangeErrorLog (0x03) • GetMemoryMap() boot services function on EFI-based systems • Identified via an Extended Address Space Descriptor with new resource type specific attribute called ACPI_MEMORY_LOG (0x0000000000002000) • Range must large enough to hold at least one error record • WHEA ACPI constructs are expected to be partof standard, but they are currently provisional

  17. Hardware SerializationERST Describes platform’s error record serialization (i.e. persistence) interface to the operating system Serialization instruction entries describe how to execute the required serialization actions

  18. Hardware SerializationSerialization action entry

  19. Hardware SerializationSample action entries Command Register: 0xFEA0 (IO) Status Register: 0x00000000AAFF0000 (Memory Mapped) { 0x00, // BEGIN_WRITE_OPERATION 0x03, // WRITE_REGISTER_VALUE 0x00, // Flags 0x00, // Reserved 0x00, 0x01, 0x00, 0x03, 0x00000000AAFF0000, 0x0000000000000080, 0x00000000000000FF } { 0x05, // EXECUTE_OPERATION 0x03, // WRITE_REGISTER_VALUE 0x00, // Flags 0x00, // Reserved 0x01, 0x01, 0x00, 0x03, 0x000000000000FEA0, 0x0000000000000040, 0x00000000000000FF }

  20. Required FunctionalityError injection • WHEA requires that platforms implement support for injecting at least one corrected and one uncorrected error • This interface is used for end-to-end error handling flow validation (operating system code, system firmware, plug-ins) • Itanium implements built-in error injection support • x86/x64 requires a PSHED Plug-In for error injection • This Plug-In should not ship to customers

  21. Extending WHEA With PSHED Plug-Ins • Error Source Enumeration • Error Source Control • Error Record Persistence • Error Information Retrieval • Error Recovery • Error Injection

  22. GetAllErrorSourcesAllows Plug-In to supply error source information PSHED Plug-InsError source enumeration Kernel PSHED 1. Kernel calls PSHED PshedGetAllErrorSources 2. PSHED creates error source table 3. PSHED calls Plug-InPlug-In augments the error source table as necessary Error Source Table Plug-In GetAllErrorSources

  23. GetErrorSourceInfoAllows Plug-In to supply info about a specific error source PSHED Plug-InsError source enumeration PCI Bus Driver Error SourceDescriptor PSHED 1. PCI Bus Driver creates descriptor PshedGetErrorSourceInfo 2. PCI Bus Driver calls PSHED 3. PSHED calls Plug-In Plug-In supplies error source info If no Plug-In, default error info is returned to the caller If Plug-In does not supply information for the specified error source, it returns error and PSHED try to fulfill the request Plug-In GetErrorSourceInfo

  24. Start/StopErrorSourceAllows Plug-In carry out the operations associated with starting/Stopping a given error source PSHED Plug-InsError source control Kernel Error SourceDescriptor PSHED 1. Kernel calls PSHED to start/stopa given error source PshedStart/StopErrorSource 2. PSHED delegates start/stop operation to the Plug-InPlug-In interacts with error source as necessary to start/stop it If Plug-In cannot start/stop the specified error source, it returns error and PSHED will try to start/stop it Plug-In Start/StopErrorSource

  25. ErrorSourceControlAllows Plug-In carry out control operationsassociated for a given error source (set error thresholds, set error severity levels, etc.) PSHED Plug-InsError source control Management Application Kernel WMI Interface 1. Management invokes WMI method Error SourceDescriptor Control Pkt 2. Kernel calls PSHED PSHED 3. PSHED delegates controloperation to the Plug-InPlug-In interacts with error source as necessary to carry out control operation If Plug-In cannot perform control operation on the specified error source, it returns error and PSHED will try PshedErrorSourceControl Plug-In ErrorSourceControl

  26. WriteErrorRecordPlug-In writes error record to persistent store PSHED Plug-InsError record persistence Kernel Error Record 1. Kernel calls PSHED PSHED 2. PSHED calls Plug-In 3. Plug-In writes the error record to non-volatile storePlug-In may invoke firmware or execute hardware commands PshedWriteErrorRecord Plug-In WriteErrorRecord PersistentError Record

  27. ReadErrorRecordPlug-In reads error record to persistent store PSHED Plug-InsError record persistence Kernel Buffer 1. Kernel calls PSHED PSHED 2. PSHED calls Plug-In 3. Plug-In reads the specified error record from non-volatile store and copies the record into the supplied bufferPlug-In may invoke firmware or execute hardware commands PshedReadErrorRecord Plug-In ReadErrorRecord PersistentError Record

  28. ClearErrorRecordPlug-In clears error record from persistent store PSHED Plug-InsError record persistence Kernel 1. Kernel calls PSHED PSHED 2. PSHED calls plug-in 3. Plug-In clears the specified error record from non-volatile storePlug-In may invoke firmware or execute hardware commands PshedClearErrorRecord Plug-In ClearClearRecord PersistentError Record

  29. RetrieveErrorInfoPlug-In retrieves error information and populates error packet PSHED Plug-InsError information retrieval LLHEH Error Packet PSHED 1. LLHEH extracts error data and creates an error packet 2. LLHEH calls PSHED 3. PSHED calls Plug-InPlug-In augments/extends the error packet as necessary using its knowledge of the hardware PshedRetrieveErrorInfo Plug-In RetrieveErrorInfo

  30. FinalizeErrorRecordPlug-In can add additional error sections to the error record PSHED Plug-InsError information retrieval Kernel Error Record Section PSHED 1. Kernel creates error record using info in the error packet 2. Kernel calls PSHED 3. PSHED calls Plug-InPlug-In adds additional sections to error record as necessary to fully describe the error condition PshedFinalizeErrorRecord Plug-In FinalizeErrorRecord

  31. ClearErrorStatusAllow Plug-In to clear private error status PSHED Plug-InsError information retrieval Kernel 1. Kernel calls PSHED PSHED 2. PSHED calls Plug-InPlug-In clears any private error status that it is using and/or responsible for PSHEDClearErrorStatus Plug-In ClearErrorStatus

  32. AttemptErrorRecoveryAllow Plug-In to perform any platform-leveloperations required to recover from the specified error condition PSHED Plug-InsError recovery Kernel Error Record PSHED 1. Kernel calls PSHED 2. PSHED may attempt recovery PSHEDAttemptErrorRecovery 3. PSHED calls Plug-InPlug-In attempts recovery and returns success/failure to PSHED 4. If recovery was successful, PSHED updates error record to inform kernel and prevent bugcheck Plug-In AttemptErrorRecovery

  33. GetErrorInjectionCaps/InjectErrorPlug-In is responsible for carryingout this operations PSHED Plug-InsError injection Management Application Kernel WMI Interface 1. Management application invokes WMI method PSHED 2. Kernel calls PSHED 3. PSHED delegates control operation to the Plug-InPlug-In returns the error injection capabilities of the platform Plug-In interacts with error source as necessary to cause the specified error to occur PshedGetInjectionCapabilities/PshedInjectError Plug-In GetInjectionCapabilities/InjectError

  34. Firmware Error Handlers • Some errors may be handled in firmware • Errata management • Error containment • If platform must handle any of the required error sources in firmware, it must implement a mechanism for notifying operating systemof the error subsequent to processing • It must implement an alternative error source for the operating system • Operating system can be notified via an interruptor by configuring a polling mechanism • This may require an accompanying PSHED Plug-Into interact with this error source

  35. Error Reporting And Management Interface • Error events are reported to applications via Event Tracing for Windows (ETW) • Microsoft will implement a logging service that listens for error events and writes entries to the system event log • WHEA Whitepaper ties together reporting and management scenarios • Two WMI Classes defined for WHEA management • Error Injection Interface • Error Source Interface

  36. WMI Management Interface • WHEAErrorInjectionMethods • GetErrorInjectionCapabilities • InjectError • WHEAErrorSourceMethods • GetAllErrorSources • GetErrorSourceInfo • SetErrorSourceInfo

  37. BMC Interaction • WHEA co-exists with BMC-based error handling/reporting solutions • Duplicate error events may show up in the event log • WHEA-aware management applications will consume WHEA events • Existing management applications that depend on BMC-originated events can continue to consume their respective events • Management applications can be consumers of both WHEA events and BMC-originated events • Tighter and better integration moving forward

  38. Status And Roadmap • Core of Windows Vista hardware error handling is based on WHEA • Implements native support PCI Express AER • Error log entries are common error record format • ETW-based logging replaces WMI-based error log entries • Windows Server codenamed “Longhorn” • Core operating system implementation nearing completion • Validating hardware vendor error record persistence and error source discovery implementations • Validating hardware vendor PSHED Plug-Ins • Future • Standardize error record format • Standardize ACPI mechanisms • Working with processor vendors on better error reporting architectures • Working with hardware vendors on better OS/firmware integration architectures • Extend WHEA to endpoint device stacks

  39. Call To Action • Work with us to get builds and resources needed for developing PSHED Plug-Ins • Work with us to validate WHEA platform support • Provide BIOS with WHEA ACPI support • Provide BIOS with error record persistence support • Work with us to develop WHEA-based management applications • Powerful health monitoring and error recovery features are possible with common error record format and ETW • Error source control applications allow for very fine grained control over error source operational parameters • Pre-boot and out-of-band error record processing applications

  40. Additional Resources • Specifications • WHEA Whitepaper • PSHED Plug-In Developer’s Guide • PSHED Interface Specification • WHEA Error Record Persistence Specification • WHEA BOOT Error Record Specification • WHEA ACPI Specification • Longhorn Server WHEA Logo Requirements • Related WinHEC Presentations • CPA070 – PCI Express in Depth for Windows Vista and Beyond • SER121 – Windows Server Platform Directions • Send feedback/comments/questions to our feedback alias wheafb @ microsoft.com

  41. © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related