Root > Solving bugs in your code > Hangs and deadlocks

Hangs and deadlocks

Previous pageReturn to chapter overviewNext page   

Hangs - User Perspective

Users like responsive applications. When they click a menu, they want the application to react instantly, even if it is currently printing their work. When they save a lengthy document in their favorite word processor, they want to continue typing while the disk is still spinning. Users get impatient rather quickly when the application does not react in a timely fashion to their input.


A programmer might recognize many legitimate reasons for an application not to instantly respond to user input. The application might be busy recalculating some data, or simply waiting for its disk I/O to complete. However, from user research, we know that users get annoyed and frustrated after just a couple of seconds of unresponsiveness. After 5 seconds, they will try to terminate a hung application. Next to crashes, application hangs are the most common source of user disruption when working with GUI applications.


There are many different root causes for application hangs, and not all of them manifest themselves in an unresponsive UI. However, an unresponsive UI is one of the most common hang experiences, and this scenario currently receives the most operating system support for both detection as well as recovery. Windows automatically detects, collects debug information, and optionally terminates or restarts hung applications. Otherwise, the user might have to restart the machine in order to recover a hung application.



Hangs - Operating System Perspective

When an application (or more accurately, a thread) creates a window on the desktop, it enters into an implicit contract with the Desktop Window Manager (DWM) to process window messages in a timely fashion. The DWM posts messages (keyboard/mouse input and messages from other windows, as well as itself) into the thread-specific message queue. The thread retrieves and dispatches those messages via its message queue. If the thread does not service the queue by calling GetMessage, messages are not processed, and the window hangs: it can neither redraw nor can it accept input from the user. The operating system detects this state by attaching a timer to pending messages in the message queue. If a message has not been retrieved within 5 seconds, the DWM declares the window to be hung. You can query this particular window state via the IsHungAppWindow API.


Detection is only the first step. At this point, the user still cannot even terminate the application - clicking the X (Close) button would result in a WM_CLOSE message, which would be stuck in the message queue just like any other message. The Desktop Window Manager assists by seamlessly hiding and then replacing the hung window with a 'ghost' copy displaying a bitmap of the original window's previous client area (and adding "Not Responding" to the title bar). As long as the original window's thread does not retrieve messages, the DWM manages both windows simultaneously, but allows the user to interact only with the ghost copy. Using this ghost window, the user can only move, minimize, and - most importantly - close the unresponsive application, but not change its internal state.


The Desktop Window Manager does one last thing; it integrates with Windows Error Reporting, allowing the user to not only close and optionally restart the application, but also send valuable debugging data back to Microsoft. You can get this hang data for your own applications by signing up at the Winqual website.


See also: WER.



Hangs - EurekaLog Perspective

EurekaLog's hang detection works similarly to system's one. If you enable hang detection - then EurekaLog will create a new thread on startup of your application. This "hang detection" thread will constantly ask UI thread to process a WM_NULL message - this is the message that do nothing. So it can be used for window polling. If an application window is hung, it will not be able to process the WM_NULL message. So, EurekaLog will detect a hang.


Note: operating system does not send WM_NULL messages to your threads. OS doesn't need this, because it already have all information available (information about last sent message and delay times). However, EurekaLog has no access to this information - thus, it must send WM_NULL message to detect hangs.


This technique works only in GUI applications (the same as technique used by operating system) and only for main thread (because GUI in VCL, CLX and FMX applications are restricted to main thread).


However, if your particular application allow some way to detect hangs - you may use RaiseFreezeException function to trigger hang detection. For example, if you spawn a background thread (to offload heavy work and let GUI remain responsive), and if you did not get reply from your background thread in sane amount of time - then you can consider your background thread as hanged, and you can call RaiseFreezeException function to invoke freeze detection dialog.


If your application is running on Vista+ system (e.g. Windows Vista, Windows 7, Windows 8, Windows 8.1, Windows 10, etc.) - then EurekaLog will use Wait Chain Traversal (WCT) API to detect deadlocks between threads. Live locks are not detected.


Once EurekaLog detects hang or deadlock in application - it raises a special constructed exception. This immediately triggers a standard exception processing, which invokes EurekaLog, displays a error dialog, sends report, etc.



Hangs - Developer Perspective

The operating system and EurekaLog defines an application hang as a UI thread that has not processed messages for at least 5 seconds (for OS) or 60 seconds (default for EurekaLog). Obvious bugs cause some hangs, for example, a thread waiting for an event that is never signaled, and two threads each holding a lock and trying to acquire the others. You can fix those bugs without too much effort. However, many hangs are not so clear. Yes, the UI thread is not retrieving messages - but it is equally busy doing other 'important' work and will eventually come back to processing messages.


However, the user perceives this as a bug. The design should match the user's expectations. If the application's design leads to an unresponsive application, the design will have to change. Finally, and this is important, unresponsiveness cannot be fixed like a code bug; it requires upfront work during the design phase. Trying to retrofit an application's existing code base to make the UI more responsive is often too expensive. The following design guidelines might help:


Make UI responsiveness a top-level requirement; the user should always feel in control of your application
Ensure that users can cancel operations that take longer than one second to complete and/or that operations can complete in the background; provide appropriate progress UI if necessary
Queue long-running or blocking operations as background tasks (this requires a well-thought out messaging mechanism to inform the UI thread when work has been completed)
Keep the code for UI threads simple; remove as many blocking API calls as possible
Show windows and dialogs only when they are ready and fully operational. If the dialog needs to display information that is too resource-intensive to calculate, show some generic information first and update it on the fly when more data becomes available. A good example is the folder properties dialog from Windows Explorer. It needs to display the folder's total size, information that is not readily available from the file system. The dialog shows up right away and the "size" field is updated from a worker thread


Unfortunately, there is no standard simple way to design and write a responsive application. Windows and Delphi do not provide a simple asynchronous framework that would allow for easy scheduling of blocking or long-running operations. The following sections introduce some of the best practices in preventing hangs and highlight some of the common pitfalls. However, there are some 3rd party frameworks and solutions available, which can help you with developing smooth applications. Please look for information about AsyncCalls, TasksEx and OTL.



Best Practices

Keep the UI Thread Simple

The UI thread's primary responsibility is to retrieve and dispatch messages. Any other kind of work introduces the risk of hanging the windows owned by this thread.



Move resource-intensive or unbounded algorithms that result in long-running operations to worker threads
Identify as many blocking function calls as possible and try to move them to worker threads; any function calling into another DLL should be suspicious
Make an extra effort to remove all file I/O and networking API calls from your worker thread. These functions can block for many seconds if not minutes. If you need to do any kind of I/O in the UI thread, consider using asynchronous I/O
Be aware that your UI thread is also servicing all single-threaded apartment (STA) COM servers hosted by your process; if you make a blocking call, these COM servers will be unresponsive until you service the message queue again


Do not:

Wait on any kernel object (like Event or Mutex) for more than a very short amount of time; if you have to wait at all, consider using MsgWaitForMultipleObjects, which will unblock when a new message arrives
Share a thread's window message queue with another thread by using the AttachThreadInput function. It is not only extremely difficult to properly synchronize access to the queue, it also can prevent the Windows operating system from properly detecting a hung window
Use TerminateThread on any of your worker threads. Terminating a thread in this way will not allow it to release locks or signal events and can easily result in orphaned synchronization objects
Call into any 'unknown' code from your UI thread. This is especially true if your application has an extensibility model; there is no guarantee that 3rd-party code follows your responsiveness guidelines
Make any kind of blocking broadcast call; SendMessage(HWND_BROADCAST) puts you at the mercy of every ill-written application currently running


Implement Asynchronous Patterns

Removing long-running or blocking operations from the UI thread requires implementing an asynchronous framework that allows offloading those operations to worker threads.



Use asynchronous window message APIs in your UI thread, especially by replacing SendMessage with one of its non-blocking peers: PostMessage, SendNotifyMessage, or SendMessageCallback
Use background threads to execute long-running or blocking tasks. Use the new thread pool API to implement your worker threads
Provide cancellation support for long-running background tasks. For blocking I/O operations, use I/O cancellation, but only as a last resort; it's not easy to cancel the 'right' operation


Use Locks Wisely

Your application or DLL needs locks to synchronize access to its internal data structures. Using multiple locks increases parallelism and makes your application more responsive. However, using multiple locks also increases the chance of acquiring those locks in different orders and causing your threads to deadlock. If two threads each hold a lock and then try to acquire the other thread's lock, their operations will form a circular wait that blocks any forward progress for these threads. You can avoid this deadlock only by ensuring that all threads in the application always acquire all locks in the same order. However, it isn't always easy to acquire locks in the 'right' order. Software components can be composed, but lock acquisitions cannot. If your code calls some other component, that component's locks now become part of your implicit lock order - even if you have no visibility into those locks.


Things get even harder because locking operations include far more than the usual functions for Critical Sections, Mutexes, and other traditional locks. Any blocking call that crosses thread boundaries has synchronization properties that can result in a deadlock. The calling thread performs an operation with 'acquire' semantics and cannot unblock until the target thread 'releases' that call. Quite a few User32 functions (for example SendMessage), as well as many blocking COM calls fall into this category.


Worse yet, the operating system has its own internal process-specific lock that sometimes is held while your code executes. This lock is acquired when DLLs are loaded into the process, and is therefore called the 'loader lock.' The DllMain function always executes under the loader lock; if you acquire any locks in DllMain (and you should not), you need to make the loader lock part of your lock order. Calling certain Win32 APIs might also acquire the loader lock on your behalf - functions like LoadLibraryEx, GetModuleHandle, and especially CoCreateInstance.



Design a lock hierarchy and obey it. Add all the necessary locks. There are many more synchronization primitives than just Mutex and CriticalSections; they all need to be included. Include the loader lock in your hierarchy if you take any locks in DllMain
Agree on locking protocol with your dependencies. Any code your application calls or that might call your application needs to share the same lock hierarchy
Lock data structures not functions. Move lock acquisitions away from function entry points and guard only data access with locks. If less code operates under a lock, there is less of a chance for deadlocks
Analyze lock acquisitions and releases in your error handling code. Often the lock hierarchy if forgotten when trying to recover from an error condition
Replace nested locks with reference counters - they cannot deadlock. Individually locked elements in lists and tables are good candidates
Be careful when waiting on a thread handle from a DLL. Always assume that your code could be called under the loader lock. It's better to reference-count your resources and let the worker thread do its own cleanup (and then use FreeLibraryAndExitThread to terminate cleanly)
Use the Wait Chain Traversal if you want to diagnose your own deadlocks


Do not:

Do anything other than very simple initialization work in your DllMain function. Especially do not call LoadLibraryEx or CoCreateInstance
Write your own locking primitives. Custom synchronization code can easily introduce subtle bugs into your code base. Use the rich selection of operating system and Delphi's RTL synchronization objects instead
Do any work in the constructors and destructors for global variables, they are executed under the loader lock


Be Careful with Exceptions

Exceptions allow the separation of normal program flow and error handling. Because of this separation, it can be difficult to know the precise state of the program prior to the exception and the exception handler might miss crucial steps in restoring a valid state. This is especially true for lock acquisitions that need to be released in the handler to prevent future deadlocks.



Use try/finally pattern with locks to ensure releasing lock on exceptions
Be careful with the code executing in exception handler; the exception might have leaked many locks, so your handler should not acquire any


Do not:

Handle native exceptions if not necessary or required. If you use native exception handlers for reporting or data recovery after catastrophic failures, consider using the EurekaLog or default operating system mechanism of Windows Error Reporting instead


This article is based on Preventing Hangs in Windows Applications

Send feedback... Build date: 2018-11-26
Last edited: 2018-06-14
The documentation team uses the feedback submitted to improve the EurekaLog documentation. We do not use your e-mail address for any other purpose. We will remove your e-mail address from our system after the issue you are reporting has been resolved. While we are working to resolve this issue, we may send you an e-mail message to request more information about your feedback. After the issues have been addressed, we may send you an email message to let you know that your feedback has been addressed.

Permanent link to this article: