250 likes | 481 Views
Fault Resilient Drivers For Longhorn Server. Sandy Arthur Program Manager Windows Server Group Sandyar @ microsoft.com Microsoft Corporation. Agenda. History and goals of resilience Technical justification for fault resilient drivers Business justification for fault resilient drivers
E N D
Fault Resilient Drivers For Longhorn Server Sandy Arthur Program Manager Windows Server Group Sandyar @ microsoft.com Microsoft Corporation
Agenda • History and goals of resilience • Technical justification for faultresilient drivers • Business justification for faultresilient drivers • General guidance for storage andnetwork drivers • Longhorn Driver Kit future tests
History And Goals • History of driver resilience and improvement • Windows 2000 device tests include Verifier • WinHEC 2002 presentation on driver hardening • Windows Server 2003 device tests include Device Path Exerciser • Goals of fault resilience for Longhorn Server • Reduce server customer unplanned downtime • Reduce support cost for hardware vendors
Adapter Reboot And Failure Data • Windows 2000 Server unplanned reboots* • 7% of all reboots due to “adapter/driver failure” • 2% of all reboots due to “hardware/firmware failure” • Software improvements to the operating system alone can not directly impact HW failures • Most of the measured systems have only one Storage or Network adapter • This is not true for large servers used for consolidation and scalable applications • Larger systems with more adapters are therefore at greater risk of failure • Rigorous testing, like that for Datacenter, can reduce risk, but not eliminate it • Operating system level testing can find HW design flaws, but will not reliably and repeatably expose HW edge case behavior or circuit production process flaws that cause transient or permanent adapter failure • Even good HW quality assurance processes are not perfect • Adapter vendors report the substantial majority of returned adapters when tested are found to operate correctly • Confirming the above, that some adapters returned have not permanently failed, fault tolerant system vendors report only ~3% net and storage adapter failure rate, not 8% • Thus, it makes sense to design the driver to ‘mask’ the effects of device failure *Data Source: Data collected from 579 production servers running Windows 2000 Server for a six months in 2001.
Fault Tolerant Systems And Adapter Failure • Fault Tolerant (FT) systems rarely crash due to adapter failures • Windows FT server systems use the same adapters as other servers • FT server systems do have fault resilient drivers • Transient HW failures are contained by the driver, then the adapter is re-initialized for continuing operation • Permanent HW failures are contained by the driver, either MPIO or teaming maintains connectivity, and there is no application impact • In both cases, there should be no crash of the system • This is a considerable part of the reason why fault tolerant platforms can achieve 99.9997% uptime • Hardened, fault resilient drivers keep the system operational • Only those HW devices which fail ‘hard’, or which have high rates of transient (but recovered) failures, need replacement • And, the replacement can become a planned outage, not unplanned
Impact On Server Customers Of Fault Resilient Drivers • Customers will have improved Mean Time Between Failure (MTBF) • Fewer reboots due to uncontained, butsurvivable HW failures • Customers will gain reduced MeanTime To Repair (MTTR) • The failures are contained, recovered and logged • Customer can determine failure point easily, instead of spending time swapping parts • System is back on line faster
Partner Impact Of FaultResilient Drivers • Fewer “Severity A, Server Down” support calls • Drivers should contain most HW errors, not bugcheck or hang the system • Shorter support calls • Logging points to offending adapter and driver’s attempts to recover • Support Engineers can more quickly resolve the issue • Fewer shipped bugs and lower ‘fix’ cost • Much more expensive to fix bugs once a product is released • Failures generated by ‘Fault Resilience Test’ are reproducible and occur ‘in house’ • Don’t occur at customer sites, where the problem is very hard to reproduce • Improved customer satisfaction with vendor’s product
Handle Hardware Error Conditions • Don't Halt/Bugcheck the system if at all possible • Handle HW error codes, don’t ignore them or automatically bugcheck • Have a periodic, frequent “LooksAlive” test • Test should be fast, ‘light weight’ check, like read/write of private memory mapped register • Execute an “IsAlive” test only as needed • If the “LooksAlive” test has failed 1-N times • Test should be a quick, but more thorough check of the adapter functionality, if possible • Make recovery as simple as possible • Don’t try to recover without informing and involvingthe operating system • Use standard Windows Driver Model capabilities and interfaces if at all possible • Log all errors and actions taken
Recover Quickly,Don’t Stall The OS • Use KeStallExecutionProcessor() sparingly • Avoid doing so within a loop • Do not call KeDelayExecutionThread() • Unless a guaranteed private thread is used • Never busywait • State transitions should be event driven byinterrupts or periodic polling • Never hold locks or block interrupts for long periods • Be careful of explicit and implicit synchronization objects • Explicit – events, semaphores, mutexes, spinlocks, etc. • Implicit • Reference counts in data structures • Worker threads if they are part of a fixed pool
Storage Drivers • All code that accesses the hardware should be changed to check for hardware failure by checking for returned data of all -1s – if present, assume the HW is ‘gone’ • Error recovery should clean up all the pending requests to the storage adapter with NO_HBA status. • If the storage adapter indicates the link is down, call ScsiPortNotification to initiate a timely rescan to detect the lost devices. The same is done when processing link up indications. • If this is not done, serious customer data corruptionissues have occurred • When handling bus reset requests, it is necessary to call ScsiPortNotification(NextLuRequest, …)
Network Drivers • MiniportCheckForHang() needs to detect hardware failures and return all pending send buffers and requests to the upper layers, if the adapter has failed • NDIS will not callback into a registered Reset() function if a halt is pending, so the MiniportHalt() function needs to do same cleanup as the Reset() function • Make sure that all timers are properly canceled in MiniportHalt() • Upper level drivers don't always return buffershanded to them • NDIS drivers often try to optimize common buffers for DMA by allocating a big chunk, then break it down for their own use, which can cause problems • Hardware error detection should be immediately communicated to NDIS via NdisMIndicateStatus()
Overview • Fault Resilience does not require newtechnology or features • No new technology required in Longhorn Server for fault resilience • No new technology required on part of drivers for fault resilience • Does require focus on building drivers which can survive HW failures • Information and testing focus will be on storage and network • Microsoft will publish ‘best practice’ thru LDK • Establish requirements through Server Device Logo standards • Test through standard Server device tests
Example Device ‘Kill’ Methods • ‘Surprise’ removal of power from Peripheral Component Interconnect (PCI) slot • Done by the test without informing or involving the driver • ‘Hide’ the configuration space for the PCI adapter • The fault resilience test will return all ‘Fs’ to the driverwhen scanned • Reprogram the Advanced Programmable Interrupt Controller (APIC) to stop interrupts from the adapter • Again, done by the test without informing or involving the driver • Note these tests require a Server class system, especially for APIC ‘kill’ method
Testing • ‘Adapter Kill Simulator’ will be included in LDK • Vendors can run tests on their own before the Logo Test • ‘Adapter Kill Simulator’ will be part of relevant Device Logo Tests • Storage and network • Will be runtime tests • Device Test will call fault resilience test for specific type of kill and length of time to simulate failures desired by the Device Test • Some example cases and conditions that can be caused by Simulator • “LooksAlive” may fail one to a few times before succeeding • “LooksAlive” may fail N times – driver may need to execute “IsAlive” test • “IsAlive” may fail one to a few times before succeeding • “IsAlive” may fail N times – driver may need to execute recovery • Recovery may fail one or more times before succeeding • Recovery may fail every time – this is also a test case forteaming and MPIO
Call To Action • Determine how to handle adapter HW errors in the driver • Read the WinHEC 2002 whitepaper • Follow ‘Best Practices’ in LDK • Test using other methods in the interim • Test using the network and storage device tests which will include fault resilience tests in the future
Resources • Driver Hardening Whitepaper • “Writing Drivers for Reliability, Robustness and Fault Tolerant Systems” http://www.microsoft.com/whdc/system/platform/server/FTdrv.mspx • Best Practices in the LDK • “Creating Reliable Kernel-Mode Drivers” • “Creating Reliable and Secure Drivers” • “Device Path Exerciser” • “Driver Programming Techniques” • Look for more info later at WHDC site
Community Resources • Community Sites • http://www.microsoft.com/communities/default.mspx • List of Newsgroups • http://communities2.microsoft.com/communities/newsgroups/en-us/default.aspx • Attend a free chat or webcast • http://www.microsoft.com/communities/chats/default.mspx • http://www.microsoft.com/seminar/events/webcasts/default.mspx • Locate a local user group(s) • http://www.microsoft.com/communities/usergroups/default.mspx • Non-Microsoft Community Sites • http://www.microsoft.com/communities/related/default.mspx