490 likes | 1.18k Views
Developing For The Windows Hardware Error Architecture. John Strange Software Design Engineer Windows Kernel Microsoft Corporation. Outline. Windows Hardware Error Architecture (WHEA) Overview What does it mean to develop for WHEA? Implement Required Functionality
E N D
Developing For The Windows Hardware Error Architecture John StrangeSoftware Design EngineerWindows Kernel Microsoft Corporation
Outline • Windows Hardware Error Architecture (WHEA) Overview • What does it mean to develop for WHEA? • Implement Required Functionality • Extending WHEA with Platform Specific Hardware Error Driver (PSHED) Plug-ins • Implementing Firmware Error Handlers • Baseboard Management Controller (BMC) Interaction • Status and Roadmap • Call to Action
Terminology • BERT – Boot Error Record Table • ERST – Error Record Serialization Table • Error Packet – Structure describing the error information extracted from a specific error source • Error Record – Full description of an error event • Error Source – Hardware resource that notifies software of errors • ETW – Event Tracing For Windows • HEST – Hardware Error Source Table • LLHEH – Low-level Hardware Error Handler • PSHED – Platform-Specific Hardware Error Driver • WHEA – Windows Hardware Error Architecture • WMI – Windows Management Interface
WHEA – In A Nutshell • Common error record format • Management applications benefit • Pre-boot and out-of-band applications • Error source discovery • Fine-grained control of error sources • Common error handling flow • All hardware errors processed by same code path • Hardware error abstractions become operating system first-class citizens • Enables error source management
Provided by: Microsoft ISV/IHV Code Gen WHEA – Overview Management/Reporting Applications ETW Error Notifications WMI Management Interface Kernel HAL PCI.SYS LLHEH LLHEH Platform-Specific Hardware Error Driver Plug-in Hardware/Firmware
WHEA Error Flow Corrected Errors PSHED Build Error Record Contained Schedule Work Item End Process Error Worker Thread Recovered Send Notifications End Persist Error Record Plug-in Bugcheck WheaReportHwError WHEA_ERROR_PACKET
Developing For WHEA • Implementing required functionality • Error Source Enumeration • Error Record Persistence • Error Injection • Adding value by extending WHEA with PSHED Plug-ins • Firmware Error Handlers • Error Reporting/Management Applications
x86/x64 Default Error Source Support • Machine Check Exceptions • Corrected Machine Checks • Non-Maskable Interrupt • PCI Express AER • BOOT Error
Itanium Default Error Source Support • Machine Check Exceptions • Corrected Machine Checks • Corrected Platform Errors • PCI Express AER • INIT
Default Error Source Support • For complete details on the default error source support, including default parameters, implemented by the PSHED, see the PSHED Plug-in Developer’s Guide
Required FunctionalityError source enumeration • PSHED implements support for default error sources • Configured with default parameters • Platform overrides default error sources only if necessary
Required FunctionalityError source enumeration • If default error source list or control parameters need to be augmented, there are two complementary ways to achieve this • Implement an ACPI HEST • Implement Error Source Discovery support in a PSHED Plug-In
Required FunctionalityError record persistence • Platform must implement support for error record persistence • Read/write/clear error records from persistent store • Single unrecoverable error record written prior to bugchecking system • WHEA reads, processes, and then clears any existing error records during subsequent boot • Pre-boot and OOB applications can also consume error records • WHEA requires that platform implement support for writing at least one error record • x86/x64 requires at least 1 KB • Consider possible limitations on 1 KB • Itanium requires at least 100 KB
Required FunctionalityError record persistence • x86/x64 • Preferred solution: Implement support for PSHED’s hardware persistence interface • Implement ACPI ERST • Implement PSHED Plug-In • Itanium • Preferred solution: Implement error record serialization support in Get/SetVariable • Modify the EFI GetVariable/SetVariable routines so they can support error record persistence • Implement PSHED Plug-In • WHEA error record persistence model coexists with Itanium SAL error record persistence
Hardware SerializationError log address range • Range of memory specifically designated to be used for error record serialization • INT 15H E820H on PCAT • Identified as AddressRangeMemory w/ new extended attribute AddressRangeErrorLog (0x03) • GetMemoryMap() boot services function on EFI-based systems • Identified via an Extended Address Space Descriptor with new resource type specific attribute called ACPI_MEMORY_LOG (0x0000000000002000) • Range must large enough to hold at least one error record • WHEA ACPI constructs are expected to be partof standard, but they are currently provisional
Hardware SerializationERST Describes platform’s error record serialization (i.e. persistence) interface to the operating system Serialization instruction entries describe how to execute the required serialization actions
Hardware SerializationSample action entries Command Register: 0xFEA0 (IO) Status Register: 0x00000000AAFF0000 (Memory Mapped) { 0x00, // BEGIN_WRITE_OPERATION 0x03, // WRITE_REGISTER_VALUE 0x00, // Flags 0x00, // Reserved 0x00, 0x01, 0x00, 0x03, 0x00000000AAFF0000, 0x0000000000000080, 0x00000000000000FF } { 0x05, // EXECUTE_OPERATION 0x03, // WRITE_REGISTER_VALUE 0x00, // Flags 0x00, // Reserved 0x01, 0x01, 0x00, 0x03, 0x000000000000FEA0, 0x0000000000000040, 0x00000000000000FF }
Required FunctionalityError injection • WHEA requires that platforms implement support for injecting at least one corrected and one uncorrected error • This interface is used for end-to-end error handling flow validation (operating system code, system firmware, plug-ins) • Itanium implements built-in error injection support • x86/x64 requires a PSHED Plug-In for error injection • This Plug-In should not ship to customers
Extending WHEA With PSHED Plug-Ins • Error Source Enumeration • Error Source Control • Error Record Persistence • Error Information Retrieval • Error Recovery • Error Injection
GetAllErrorSourcesAllows Plug-In to supply error source information PSHED Plug-InsError source enumeration Kernel PSHED 1. Kernel calls PSHED PshedGetAllErrorSources 2. PSHED creates error source table 3. PSHED calls Plug-InPlug-In augments the error source table as necessary Error Source Table Plug-In GetAllErrorSources
GetErrorSourceInfoAllows Plug-In to supply info about a specific error source PSHED Plug-InsError source enumeration PCI Bus Driver Error SourceDescriptor PSHED 1. PCI Bus Driver creates descriptor PshedGetErrorSourceInfo 2. PCI Bus Driver calls PSHED 3. PSHED calls Plug-In Plug-In supplies error source info If no Plug-In, default error info is returned to the caller If Plug-In does not supply information for the specified error source, it returns error and PSHED try to fulfill the request Plug-In GetErrorSourceInfo
Start/StopErrorSourceAllows Plug-In carry out the operations associated with starting/Stopping a given error source PSHED Plug-InsError source control Kernel Error SourceDescriptor PSHED 1. Kernel calls PSHED to start/stopa given error source PshedStart/StopErrorSource 2. PSHED delegates start/stop operation to the Plug-InPlug-In interacts with error source as necessary to start/stop it If Plug-In cannot start/stop the specified error source, it returns error and PSHED will try to start/stop it Plug-In Start/StopErrorSource
ErrorSourceControlAllows Plug-In carry out control operationsassociated for a given error source (set error thresholds, set error severity levels, etc.) PSHED Plug-InsError source control Management Application Kernel WMI Interface 1. Management invokes WMI method Error SourceDescriptor Control Pkt 2. Kernel calls PSHED PSHED 3. PSHED delegates controloperation to the Plug-InPlug-In interacts with error source as necessary to carry out control operation If Plug-In cannot perform control operation on the specified error source, it returns error and PSHED will try PshedErrorSourceControl Plug-In ErrorSourceControl
WriteErrorRecordPlug-In writes error record to persistent store PSHED Plug-InsError record persistence Kernel Error Record 1. Kernel calls PSHED PSHED 2. PSHED calls Plug-In 3. Plug-In writes the error record to non-volatile storePlug-In may invoke firmware or execute hardware commands PshedWriteErrorRecord Plug-In WriteErrorRecord PersistentError Record
ReadErrorRecordPlug-In reads error record to persistent store PSHED Plug-InsError record persistence Kernel Buffer 1. Kernel calls PSHED PSHED 2. PSHED calls Plug-In 3. Plug-In reads the specified error record from non-volatile store and copies the record into the supplied bufferPlug-In may invoke firmware or execute hardware commands PshedReadErrorRecord Plug-In ReadErrorRecord PersistentError Record
ClearErrorRecordPlug-In clears error record from persistent store PSHED Plug-InsError record persistence Kernel 1. Kernel calls PSHED PSHED 2. PSHED calls plug-in 3. Plug-In clears the specified error record from non-volatile storePlug-In may invoke firmware or execute hardware commands PshedClearErrorRecord Plug-In ClearClearRecord PersistentError Record
RetrieveErrorInfoPlug-In retrieves error information and populates error packet PSHED Plug-InsError information retrieval LLHEH Error Packet PSHED 1. LLHEH extracts error data and creates an error packet 2. LLHEH calls PSHED 3. PSHED calls Plug-InPlug-In augments/extends the error packet as necessary using its knowledge of the hardware PshedRetrieveErrorInfo Plug-In RetrieveErrorInfo
FinalizeErrorRecordPlug-In can add additional error sections to the error record PSHED Plug-InsError information retrieval Kernel Error Record Section PSHED 1. Kernel creates error record using info in the error packet 2. Kernel calls PSHED 3. PSHED calls Plug-InPlug-In adds additional sections to error record as necessary to fully describe the error condition PshedFinalizeErrorRecord Plug-In FinalizeErrorRecord
ClearErrorStatusAllow Plug-In to clear private error status PSHED Plug-InsError information retrieval Kernel 1. Kernel calls PSHED PSHED 2. PSHED calls Plug-InPlug-In clears any private error status that it is using and/or responsible for PSHEDClearErrorStatus Plug-In ClearErrorStatus
AttemptErrorRecoveryAllow Plug-In to perform any platform-leveloperations required to recover from the specified error condition PSHED Plug-InsError recovery Kernel Error Record PSHED 1. Kernel calls PSHED 2. PSHED may attempt recovery PSHEDAttemptErrorRecovery 3. PSHED calls Plug-InPlug-In attempts recovery and returns success/failure to PSHED 4. If recovery was successful, PSHED updates error record to inform kernel and prevent bugcheck Plug-In AttemptErrorRecovery
GetErrorInjectionCaps/InjectErrorPlug-In is responsible for carryingout this operations PSHED Plug-InsError injection Management Application Kernel WMI Interface 1. Management application invokes WMI method PSHED 2. Kernel calls PSHED 3. PSHED delegates control operation to the Plug-InPlug-In returns the error injection capabilities of the platform Plug-In interacts with error source as necessary to cause the specified error to occur PshedGetInjectionCapabilities/PshedInjectError Plug-In GetInjectionCapabilities/InjectError
Firmware Error Handlers • Some errors may be handled in firmware • Errata management • Error containment • If platform must handle any of the required error sources in firmware, it must implement a mechanism for notifying operating systemof the error subsequent to processing • It must implement an alternative error source for the operating system • Operating system can be notified via an interruptor by configuring a polling mechanism • This may require an accompanying PSHED Plug-Into interact with this error source
Error Reporting And Management Interface • Error events are reported to applications via Event Tracing for Windows (ETW) • Microsoft will implement a logging service that listens for error events and writes entries to the system event log • WHEA Whitepaper ties together reporting and management scenarios • Two WMI Classes defined for WHEA management • Error Injection Interface • Error Source Interface
WMI Management Interface • WHEAErrorInjectionMethods • GetErrorInjectionCapabilities • InjectError • WHEAErrorSourceMethods • GetAllErrorSources • GetErrorSourceInfo • SetErrorSourceInfo
BMC Interaction • WHEA co-exists with BMC-based error handling/reporting solutions • Duplicate error events may show up in the event log • WHEA-aware management applications will consume WHEA events • Existing management applications that depend on BMC-originated events can continue to consume their respective events • Management applications can be consumers of both WHEA events and BMC-originated events • Tighter and better integration moving forward
Status And Roadmap • Core of Windows Vista hardware error handling is based on WHEA • Implements native support PCI Express AER • Error log entries are common error record format • ETW-based logging replaces WMI-based error log entries • Windows Server codenamed “Longhorn” • Core operating system implementation nearing completion • Validating hardware vendor error record persistence and error source discovery implementations • Validating hardware vendor PSHED Plug-Ins • Future • Standardize error record format • Standardize ACPI mechanisms • Working with processor vendors on better error reporting architectures • Working with hardware vendors on better OS/firmware integration architectures • Extend WHEA to endpoint device stacks
Call To Action • Work with us to get builds and resources needed for developing PSHED Plug-Ins • Work with us to validate WHEA platform support • Provide BIOS with WHEA ACPI support • Provide BIOS with error record persistence support • Work with us to develop WHEA-based management applications • Powerful health monitoring and error recovery features are possible with common error record format and ETW • Error source control applications allow for very fine grained control over error source operational parameters • Pre-boot and out-of-band error record processing applications
Additional Resources • Specifications • WHEA Whitepaper • PSHED Plug-In Developer’s Guide • PSHED Interface Specification • WHEA Error Record Persistence Specification • WHEA BOOT Error Record Specification • WHEA ACPI Specification • Longhorn Server WHEA Logo Requirements • Related WinHEC Presentations • CPA070 – PCI Express in Depth for Windows Vista and Beyond • SER121 – Windows Server Platform Directions • Send feedback/comments/questions to our feedback alias wheafb @ microsoft.com
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.