270 likes | 408 Views
Driver Hangs – Detection And Prevention. Gerald Maffeo Program Manager Windows Reliability Team geraldm @ microsoft.com Narayanan Ganapathy Architect Windows Device Experience narg @ microsoft.com. Session Goals. Attendees should leave this session with the following
E N D
Driver Hangs – Detection And Prevention Gerald Maffeo Program Manager Windows Reliability Team geraldm @ microsoft.com Narayanan Ganapathy Architect Windows Device Experience narg @ microsoft.com
Session Goals • Attendees should leave this sessionwith the following • A better understanding of driver hangs and how to find and prevent them • Knowledge of where to find resources for I/O completion/cancellation guidelines and programming information for driver cancellation • The industry will benefit from this session in the following ways • Improved driver quality will result in fewer failed application terminations • Improved driver cancellation will enable new application cancellation paradigm
Session Outline • Why Hangs? What Causes Them? • Windows Initiatives • I/O Cancellation Overview and Examples • I/O Completion/Cancellation Guidelines • Techniques for Canceling I/O Requests • Cancel-Safe IRP Queues • Compliance Test Tools • Call To Action • Question and Answer
Why Hangs? • A hang is a loss of responsiveness from an application or system for as little as a few seconds: enough to cause the user to attempt recovery, including rebooting • Hangs are a top cause of loss of productivityand customer pain • Numerous surveys and customer satisfaction studies for consumer and corporate desktop environments confirm this • New instrumentation and Windows Error Reports provide hard data • Driver hangs cause applications or systems to hang • Especially serious, since they usually cause reboots • Impact is roughly same as Online Crash Analysis (OCA)driver crashes • Driver hangs can cause apps to appear to hang randomly
How Do Drivers Cause Hangs? Operations do not complete in a timely fashion and do not adequately support cancellation The most common causes include • Drivers block user-mode threads withkernel-mode waits • For programmer convenience or to minimizedevelopment effort • Due to lack of familiarity with programming asynchronous I/O • Drivers do not implement cancellation • Faulty assumptions about completion times • Difficulty passing cancellation down the stack • Queuing and synchronization logic is complex
Longhorn Initiatives • New Windows operating system support • Cancelable create requests (IRP_MJ_CREATE) • New Win32 cancellation Application ProgrammingInterfaces (APIs) • Kernel Hang Reporting • Driver Hang Verifier option to test completion/cancellation compliance • New I/O completion/cancellation DFW / logo requirements • New Cancellation Support for Applications (“stop button”) • Enables users to cancel operations on demand • Solid driver support for cancellation now far more important
I/O Cancellation Overview • A mechanism to cancel already – issued I/O requests (IRP) • IRP cancellation scenarios • Thread / process termination • Application wants to cancel an I/O request • Asynchronous I/O (CanceIIo, CancelIoEx) • Synchronous I/O (CancelSynchronousIo)
Process Termination Example Application Process Terminated System cancels all I/O associated with the process I/O call(s) Process cleanup occurs only after all IRPs complete or cancel I/O Manager Cancel routine(s) invoked Driver(s) Cancel Routine(s)
Longhorn Cancellation – Synchronous I/O Application Status -> app T2 passes T1’s handle Thread 1 (T1) waits for I/O to complete Another process thread (T2) requests cancellation CreateFile() CancelSynchronousIo() Returns immediately I/O Manager tries to cancel T1’s synchronous I/O I/O Manager Driver returns with STATUS_CANCELLED Cancel routine invoked Driver IRP completes Cancel routine
Longhorn Cancellation – Asynchronous I/O Application Status -> app Passes file handle Some thread in process requests cancellation for all pending file I/O on specified handle ReadFileEx() CancelIoEx() Returns immediately I/O Manager tries to cancel all pending I/O on this handle I/O Manager Driver returns with STATUS_CANCELLED Cancel routine(s) invoked Driver IRP completes Cancel routine
Key Takeaways • Kernel-mode waits that block user-mode threads are bad • Cannot be interrupted • Don’t wait inside driver (Return STATUS_PENDING) • Always set a Cancel routine on IRPs that are held in a queue
I/O Completion/Cancellation Guidelines • Published at WinHEC 2003 • Apply to drivers intended for use with Longhorn • Apply also to IRP_MJ_CREATE, which now supports cancellation • Existing drivers may not automatically be compliant • Compliance with I/O completion/cancellation guidelines is essential • To not block application termination or system shutdown • To support application’s ability to cancel I/O operations • Proposed requirement for DFW Program, 0.45 draft • Some Definitions • Reasonable Period (TR): <<10 seconds from the initial request • Long-term Request: A request that can take > TR to complete • Pend: means a driver marks the IRP pending and return STATUS_PENDING
I/O Completion/CancellationGuidelines Walkthrough (1/2) • All driver paths must either ensure timely completion or support cancellation • Any request that takes an indefinite time to complete must be cancelable • Waits that block on user-initiated events, e.g., keyboard reads etc. • Drivers should not block user-mode threads inside dispatch routines for >TR • All long-term requests should be pended • When a driver pends an IRP it must • Support IRP cancellation; or • Complete the operation by TR (possibly using timeouts)
I/O Completion/CancellationGuidelines Walkthrough (2/2) • Close/cleanup requests should never block >TR • A driver that creates new requests to pass to other drivers must also pass on cancellation or be able to disassociate them from the original IRP issued by the I/O Manager • All requests should complete by TRafter being canceled • A driver about to complete an IRP for anything other than the current thread must be suspension-proof • A driver should never pend a canceled request
Techniques For CancelingI/O Requests Cancellation is inherently asynchronous and therefore difficult to implement correctly. Race conditions can lead to obscure and infrequent failures. • Use the system-wide cancel spin lock and include a Cancel routine • Not recommended for drivers that do frequent I/O • Use a driver-supplied locking mechanism and include a Cancel routine • Tricky to write and test due to race conditions • No longer recommended for drivers that do frequent I/O • Use cancel-safe IRP queuing (CSQ) • Cancel routine not required • Strongly recommended for existing drivers (supports all Windows versions) • New drivers should use the Windows Driver Foundation • Strongly recommended for new drivers
Cancel-Safe IRP Queues – Overview • CSQ provides a framework to implement IRP queuing • CSQ framework handles cancellation to prevent race conditions • CSQ support is available now for Windows 2000 and later • CSQ allows drivers to implement simple queue logic and leave the complex cancellation logic to CSQ library • CSQ library provides the following routines • IoCsqInitialize / IoCsqInitializeEx – Initializes the driver's CSQ dispatch table • IoCsqInsertIrp / IoCsqInsertIrpEx – Inserts an IRP into the driver's queue • IoCsqRemoveIrp – Removes a particular IRP from the queue • IoCsqRemoveNextIrp – Removes the next matching IRP in the queue • Drivers must provide • Callback routines to manage queues and locks – these are protected by the framework from collisions with IRP cancellation code • A lock with which to lock the queue • Storage for the queue of pending IRPs
Cancel-Safe IRP Queues – Driver Callbacks • Driver-provided callback routines • XxxInsertIrp / xxxInsertIrpEx – Inserts the IRPinto the queue • XxxRemoveIrp – Removes the matching IRPfrom the queue • XxxPeekNextIrp – Returns the next IRP in the queue • XxxAcquireLock – Locks the queue • XxxReleaseLock – Unlocks the queue • XxxCompleteCanceledIrp – Cancels andcompletes the IRP
Two Usage Models • Requests spend most time (>TR) in queues – once de-queued, request processing time is short, e.g. a storage stack • Driver inserts IRP into queue using IoCsqInsertIrp(Ex) if device is busy • Driver removes next IRP from queue using IoCsqRemoveNextIrp when device becomes free • Canceled IRPs are ignored by IoCsqRemoveNextIrp • Requests posted to hardware can take a long time (>TR) to process – hardware allows request to be aborted, e.g. a read request on a serial port • Driver allocates a context with the IRP (IrpContext) • Driver passes context to IoCsqInsertIrp(Ex)– CSQ associates context with IRP • Driver uses context instead of IRP • ISR gets context • IRP should not be touched • When ISR or timer fires, driver calls IoCsqRemoveIrp with context • IRP is returned only if it is not canceled • Queue implementation callbacks could be empty if only this model is used
Cancel-Safe IRP Queues – Initialization • Driver allocates queue lock and IO_CSQ structure in its device extension: typedef struct _DEVICE_EXTENSION { LIST_ENTRY PendingIrpQueue; // IRPs are queued here KSPIN_LOCK QueueLock; // Spin lock protects access to the queue PIRP CurrentIrp; // Ptr to current device IRP // QueueLock provides exclusive access // … Some driver-specific fields IO_CSQ CancelSafeQueue; // Cancel-safe IRP queue } DEVICE_EXTENSION, *PDEVICE_EXTENSION; • Driver calls IoCsqInitialize or IoCsqInitializeEx In DriverEntry: IoCsqInitializeEx( &devExtension->CancelSafeQueue, XxxInsertIrp, XxxRemoveIrp, XxxPeekNextIrp, XxxAcquireLock, XxxReleaseLock, XxxCompleteCanceledIrp );
Kernel Hang Reporting • Provides ongoing OCA-like driver hang failure reports • KHR integration with new live kernel mini-dump feature • New driver quality feature provides kernel data, including stacks and pending I/O requests, for apps that are blocked from terminating • Extended debug API for live kernel minidumps supports both on-demand and KHR automated dump generation • Hang reports (WER) include process timestamps and kernel minidumps if a hung process still exists 10 seconds (default) after TerminateProcess was called • Reporting will start with WinHEC 2004 release
Driver Hang Verifier • New completion/cancellation option for I/O Verifier • Parameters and reports accessed through !verifier (debugger) • Optionally monitors completion times for I/O requests for a targeted set of drivers • Configurable timer to report requests with excessive completion times • Optionally injects random cancellations (IoCancelIrp) for a targeted set of drivers • Probability value ranges from cancel none (0) through cancel all (100) • Configurable timers permit • Setting reasonable completion time for requests before canceling • Reporting excessive cancellation times • Available with WinHEC 2004 release
Call To Action • Understand how hangs in your drivers can impact customer satisfaction • Support I/O completion/cancellation guidelines in your drivers • But… understand your hardware! In rare cases, adherence to these guidelines may not be feasible • Getting cancellation right is hard. We strongly recommend using the • Cancel-Safe IRP Queues library to implement cancellation with minimal work • Use Windows Driver Foundation for new drivers • If you have to roll your own cancellation, first read and understand the “Cancel Logic in Windows Drivers” whitepaper (it’s in the DDK) • Use the new Driver Hang Verifier verification options • Test, test, test!
Resources • Email • Books • Programming the Microsoft Windows Driver Model, Second Editionby Walter Oney, Microsoft Press, 2002 • Web Resources • I/O Completion/Cancellation Guidelines http://www.microsoft.com/whdc/hwdev/driver/IOcancel.mspx • Cancel-Safe IRP Queues library http://www.microsoft.com/whdc/ddk • “Cancel Logic in Windows Drivers” white paper http://www.microsoft.com/whdc/hwdev/driver/cancel_logic.mspx • Driver Verifier documentation http://www.microsoft.com/whdc/ddk dvrhangs @ microsoft.com
Community Resources • Community Sites • http://www.microsoft.com/communities/default.mspx • List of Newsgroups • http://communities2.microsoft.com/communities/newsgroups/en-us/default.aspx • Attend a free chat or webcast • http://www.microsoft.com/communities/chats/default.mspx • http://www.microsoft.com/seminar/events/webcasts/default.mspx • Locate a local user group(s) • http://www.microsoft.com/communities/usergroups/default.mspx • Non-Microsoft Community Sites • http://www.microsoft.com/communities/related/default.mspx