1 / 26

Driver Hangs – Detection And Prevention

Driver Hangs – Detection And Prevention. Gerald Maffeo Program Manager Windows Reliability Team geraldm @ microsoft.com Narayanan Ganapathy Architect Windows Device Experience narg @ microsoft.com. Session Goals. Attendees should leave this session with the following

jihan
Download Presentation

Driver Hangs – Detection And Prevention

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Driver Hangs – Detection And Prevention Gerald Maffeo Program Manager Windows Reliability Team geraldm @ microsoft.com Narayanan Ganapathy Architect Windows Device Experience narg @ microsoft.com

  2. Session Goals • Attendees should leave this sessionwith the following • A better understanding of driver hangs and how to find and prevent them • Knowledge of where to find resources for I/O completion/cancellation guidelines and programming information for driver cancellation • The industry will benefit from this session in the following ways • Improved driver quality will result in fewer failed application terminations • Improved driver cancellation will enable new application cancellation paradigm

  3. Session Outline • Why Hangs? What Causes Them? • Windows Initiatives • I/O Cancellation Overview and Examples • I/O Completion/Cancellation Guidelines • Techniques for Canceling I/O Requests • Cancel-Safe IRP Queues • Compliance Test Tools • Call To Action • Question and Answer

  4. Why Hangs? • A hang is a loss of responsiveness from an application or system for as little as a few seconds: enough to cause the user to attempt recovery, including rebooting • Hangs are a top cause of loss of productivityand customer pain • Numerous surveys and customer satisfaction studies for consumer and corporate desktop environments confirm this • New instrumentation and Windows Error Reports provide hard data • Driver hangs cause applications or systems to hang • Especially serious, since they usually cause reboots • Impact is roughly same as Online Crash Analysis (OCA)driver crashes • Driver hangs can cause apps to appear to hang randomly

  5. How Do Drivers Cause Hangs? Operations do not complete in a timely fashion and do not adequately support cancellation The most common causes include • Drivers block user-mode threads withkernel-mode waits • For programmer convenience or to minimizedevelopment effort • Due to lack of familiarity with programming asynchronous I/O • Drivers do not implement cancellation • Faulty assumptions about completion times • Difficulty passing cancellation down the stack • Queuing and synchronization logic is complex

  6. Longhorn Initiatives • New Windows operating system support • Cancelable create requests (IRP_MJ_CREATE) • New Win32 cancellation Application ProgrammingInterfaces (APIs) • Kernel Hang Reporting • Driver Hang Verifier option to test completion/cancellation compliance • New I/O completion/cancellation DFW / logo requirements • New Cancellation Support for Applications (“stop button”) • Enables users to cancel operations on demand • Solid driver support for cancellation now far more important

  7. I/O Cancellation Overview • A mechanism to cancel already – issued I/O requests (IRP) • IRP cancellation scenarios • Thread / process termination • Application wants to cancel an I/O request • Asynchronous I/O (CanceIIo, CancelIoEx) • Synchronous I/O (CancelSynchronousIo)

  8. Process Termination Example Application Process Terminated System cancels all I/O associated with the process I/O call(s) Process cleanup occurs only after all IRPs complete or cancel I/O Manager Cancel routine(s) invoked Driver(s) Cancel Routine(s)

  9. Longhorn Cancellation – Synchronous I/O Application Status -> app T2 passes T1’s handle Thread 1 (T1) waits for I/O to complete Another process thread (T2) requests cancellation CreateFile() CancelSynchronousIo() Returns immediately I/O Manager tries to cancel T1’s synchronous I/O I/O Manager Driver returns with STATUS_CANCELLED Cancel routine invoked Driver IRP completes Cancel routine

  10. Longhorn Cancellation – Asynchronous I/O Application Status -> app Passes file handle Some thread in process requests cancellation for all pending file I/O on specified handle ReadFileEx() CancelIoEx() Returns immediately I/O Manager tries to cancel all pending I/O on this handle I/O Manager Driver returns with STATUS_CANCELLED Cancel routine(s) invoked Driver IRP completes Cancel routine

  11. Key Takeaways • Kernel-mode waits that block user-mode threads are bad • Cannot be interrupted • Don’t wait inside driver (Return STATUS_PENDING) • Always set a Cancel routine on IRPs that are held in a queue

  12. I/O Completion/Cancellation Guidelines • Published at WinHEC 2003 • Apply to drivers intended for use with Longhorn • Apply also to IRP_MJ_CREATE, which now supports cancellation • Existing drivers may not automatically be compliant • Compliance with I/O completion/cancellation guidelines is essential • To not block application termination or system shutdown • To support application’s ability to cancel I/O operations • Proposed requirement for DFW Program, 0.45 draft • Some Definitions • Reasonable Period (TR): <<10 seconds from the initial request • Long-term Request: A request that can take > TR to complete • Pend: means a driver marks the IRP pending and return STATUS_PENDING

  13. I/O Completion/CancellationGuidelines Walkthrough (1/2) • All driver paths must either ensure timely completion or support cancellation • Any request that takes an indefinite time to complete must be cancelable • Waits that block on user-initiated events, e.g., keyboard reads etc. • Drivers should not block user-mode threads inside dispatch routines for >TR • All long-term requests should be pended • When a driver pends an IRP it must • Support IRP cancellation; or • Complete the operation by TR (possibly using timeouts)

  14. I/O Completion/CancellationGuidelines Walkthrough (2/2) • Close/cleanup requests should never block >TR • A driver that creates new requests to pass to other drivers must also pass on cancellation or be able to disassociate them from the original IRP issued by the I/O Manager • All requests should complete by TRafter being canceled • A driver about to complete an IRP for anything other than the current thread must be suspension-proof • A driver should never pend a canceled request

  15. Techniques For CancelingI/O Requests Cancellation is inherently asynchronous and therefore difficult to implement correctly. Race conditions can lead to obscure and infrequent failures. • Use the system-wide cancel spin lock and include a Cancel routine • Not recommended for drivers that do frequent I/O • Use a driver-supplied locking mechanism and include a Cancel routine • Tricky to write and test due to race conditions • No longer recommended for drivers that do frequent I/O • Use cancel-safe IRP queuing (CSQ) • Cancel routine not required • Strongly recommended for existing drivers (supports all Windows versions) • New drivers should use the Windows Driver Foundation • Strongly recommended for new drivers

  16. Cancel-Safe IRP Queues – Overview • CSQ provides a framework to implement IRP queuing • CSQ framework handles cancellation to prevent race conditions • CSQ support is available now for Windows 2000 and later • CSQ allows drivers to implement simple queue logic and leave the complex cancellation logic to CSQ library • CSQ library provides the following routines • IoCsqInitialize / IoCsqInitializeEx – Initializes the driver's CSQ dispatch table • IoCsqInsertIrp / IoCsqInsertIrpEx – Inserts an IRP into the driver's queue • IoCsqRemoveIrp – Removes a particular IRP from the queue • IoCsqRemoveNextIrp – Removes the next matching IRP in the queue • Drivers must provide • Callback routines to manage queues and locks – these are protected by the framework from collisions with IRP cancellation code • A lock with which to lock the queue • Storage for the queue of pending IRPs

  17. Cancel-Safe IRP Queues – Driver Callbacks • Driver-provided callback routines • XxxInsertIrp / xxxInsertIrpEx – Inserts the IRPinto the queue • XxxRemoveIrp – Removes the matching IRPfrom the queue • XxxPeekNextIrp – Returns the next IRP in the queue • XxxAcquireLock – Locks the queue • XxxReleaseLock – Unlocks the queue • XxxCompleteCanceledIrp – Cancels andcompletes the IRP

  18. Two Usage Models • Requests spend most time (>TR) in queues – once de-queued, request processing time is short, e.g. a storage stack • Driver inserts IRP into queue using IoCsqInsertIrp(Ex) if device is busy • Driver removes next IRP from queue using IoCsqRemoveNextIrp when device becomes free • Canceled IRPs are ignored by IoCsqRemoveNextIrp • Requests posted to hardware can take a long time (>TR) to process – hardware allows request to be aborted, e.g. a read request on a serial port • Driver allocates a context with the IRP (IrpContext) • Driver passes context to IoCsqInsertIrp(Ex)– CSQ associates context with IRP • Driver uses context instead of IRP • ISR gets context • IRP should not be touched • When ISR or timer fires, driver calls IoCsqRemoveIrp with context • IRP is returned only if it is not canceled • Queue implementation callbacks could be empty if only this model is used

  19. Cancel-Safe IRP Queues – Initialization • Driver allocates queue lock and IO_CSQ structure in its device extension: typedef struct _DEVICE_EXTENSION { LIST_ENTRY PendingIrpQueue; // IRPs are queued here KSPIN_LOCK QueueLock; // Spin lock protects access to the queue PIRP CurrentIrp; // Ptr to current device IRP // QueueLock provides exclusive access // … Some driver-specific fields IO_CSQ CancelSafeQueue; // Cancel-safe IRP queue } DEVICE_EXTENSION, *PDEVICE_EXTENSION; • Driver calls IoCsqInitialize or IoCsqInitializeEx In DriverEntry: IoCsqInitializeEx( &devExtension->CancelSafeQueue, XxxInsertIrp, XxxRemoveIrp, XxxPeekNextIrp, XxxAcquireLock, XxxReleaseLock, XxxCompleteCanceledIrp );

  20. Kernel Hang Reporting • Provides ongoing OCA-like driver hang failure reports • KHR integration with new live kernel mini-dump feature • New driver quality feature provides kernel data, including stacks and pending I/O requests, for apps that are blocked from terminating • Extended debug API for live kernel minidumps supports both on-demand and KHR automated dump generation • Hang reports (WER) include process timestamps and kernel minidumps if a hung process still exists 10 seconds (default) after TerminateProcess was called • Reporting will start with WinHEC 2004 release

  21. Driver Hang Verifier • New completion/cancellation option for I/O Verifier • Parameters and reports accessed through !verifier (debugger) • Optionally monitors completion times for I/O requests for a targeted set of drivers • Configurable timer to report requests with excessive completion times • Optionally injects random cancellations (IoCancelIrp) for a targeted set of drivers • Probability value ranges from cancel none (0) through cancel all (100) • Configurable timers permit • Setting reasonable completion time for requests before canceling • Reporting excessive cancellation times • Available with WinHEC 2004 release

  22. Call To Action • Understand how hangs in your drivers can impact customer satisfaction • Support I/O completion/cancellation guidelines in your drivers • But… understand your hardware! In rare cases, adherence to these guidelines may not be feasible • Getting cancellation right is hard. We strongly recommend using the • Cancel-Safe IRP Queues library to implement cancellation with minimal work • Use Windows Driver Foundation for new drivers • If you have to roll your own cancellation, first read and understand the “Cancel Logic in Windows Drivers” whitepaper (it’s in the DDK) • Use the new Driver Hang Verifier verification options • Test, test, test!

  23. Resources • Email • Books • Programming the Microsoft Windows Driver Model, Second Editionby Walter Oney, Microsoft Press, 2002 • Web Resources • I/O Completion/Cancellation Guidelines http://www.microsoft.com/whdc/hwdev/driver/IOcancel.mspx • Cancel-Safe IRP Queues library http://www.microsoft.com/whdc/ddk • “Cancel Logic in Windows Drivers” white paper http://www.microsoft.com/whdc/hwdev/driver/cancel_logic.mspx • Driver Verifier documentation http://www.microsoft.com/whdc/ddk dvrhangs @ microsoft.com

  24. Community Resources • Community Sites • http://www.microsoft.com/communities/default.mspx • List of Newsgroups • http://communities2.microsoft.com/communities/newsgroups/en-us/default.aspx • Attend a free chat or webcast • http://www.microsoft.com/communities/chats/default.mspx • http://www.microsoft.com/seminar/events/webcasts/default.mspx • Locate a local user group(s) • http://www.microsoft.com/communities/usergroups/default.mspx • Non-Microsoft Community Sites • http://www.microsoft.com/communities/related/default.mspx

More Related