Yesterday the four node file cluster resource crashed and blue
screened and was moved to another node. I wanted to analyze the crash
dump file (C:\Windows\Minidump\070711-36473-01.dmp) so I copied it to my
W7 workstation and tried to open it but Visual Studio could not help me
out here.
Reading a crash dump file is far from intuitive and I spent a great deal of the morning learning about debugging. So here is what I did to read the dump file.
First, you need to install the debugging tools from here. Choose the version that corresponds to your architecture. This install will take a long time depending on your network speed. Important is that you include the WinDbg.exe because that is the tool we will be using.
Next, you need to download the symbol files. Note that you can also use the symbol server from Microsoft but it is faster to have a copy of the symbol files on your hard drive. Download them here. Just download them all. And this will also take a long time because the Symbol files are huge.
Next! Open C:\Program Files\Debugging Tools for Windows (x86)\WinDb.exe.
Choose File -> Open -> Symbol File Path
Type: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols like this:
Now press CTRL+D to open the DWP file! Very exciting.
Now, if you enter !analyze -v like this:
And you’ll get more information about the crash. In my case:
Explanation: USER_MODE_HEALTH_MONITOR (9e) is the bug check code I
need to investigate. For a complete list of bugcheck codes look here:
http://msdn.microsoft.com/en-us/library/ff542347%28v=VS.85%29.aspx
And now all that is left for me to say is: ‘happy debugging’.
Oh here are some helpful links: http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx
http://blogs.msdn.com/b/ntdebugging/archive/tags/hangs/
Reading a crash dump file is far from intuitive and I spent a great deal of the morning learning about debugging. So here is what I did to read the dump file.
First, you need to install the debugging tools from here. Choose the version that corresponds to your architecture. This install will take a long time depending on your network speed. Important is that you include the WinDbg.exe because that is the tool we will be using.
Next, you need to download the symbol files. Note that you can also use the symbol server from Microsoft but it is faster to have a copy of the symbol files on your hard drive. Download them here. Just download them all. And this will also take a long time because the Symbol files are huge.
Next! Open C:\Program Files\Debugging Tools for Windows (x86)\WinDb.exe.
Choose File -> Open -> Symbol File Path
Type: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols like this:
Now press CTRL+D to open the DWP file! Very exciting.
Now, if you enter !analyze -v like this:
And you’ll get more information about the crash. In my case:
8: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* USER_MODE_HEALTH_MONITOR (9e) One or more critical user mode components failed to satisfy a health check. Hardware mechanisms such as watchdog timers can detect that basic kernel services are not executing. However, resource starvation issues, including memory leaks, lock contention, and scheduling priority misconfiguration, may block critical user mode components without blocking DPCs or draining the nonpaged pool. Kernel components can extend watchdog timer functionality to user mode by periodically monitoring critical applications. This bugcheck indicates that a user mode health check failed in a manner such that graceful shutdown is unlikely to succeed. It restores critical services by rebooting and/or allowing application failover to other servers. Arguments: Arg1: fffffa8038f3ab30, Process that failed to satisfy a health check within the configured timeout Arg2: 00000000000004b0, Health monitoring timeout (seconds) Arg3: 0000000000000000 Arg4: 0000000000000000 Debugging Details: ------------------ PROCESS_OBJECT: fffffa8038f3ab30 CUSTOMER_CRASH_COUNT: 1 DEFAULT_BUCKET_ID: DRIVER_FAULT_SERVER_MINIDUMP BUGCHECK_STR: 0x9E PROCESS_NAME: System CURRENT_IRQL: 2 LAST_CONTROL_TRANSFER: from fffff880030b76a5 to fffff80001a98d00 STACK_TEXT: fffff880`0253d518 fffff880`030b76a5 : 00000000`0000009e fffffa80`38f3ab30 00000000`000004b0 00000000`00000000 : nt!KeBugCheckEx fffff880`0253d520
fffff800`01aa4652 : fffff880`0253d600 00000000`00000000
00000000`40800088 00000000`00000001 : netft!NetftWatchdogTimerDpc+0xb9 fffff880`0253d570
fffff800`01aa44f6 : fffff880`030c4100 00000000`03023940
00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x66 fffff880`0253d5e0
fffff800`01aa43de : 00000729`6e09a2ce fffff880`0253dc58
00000000`03023940 fffff880`02517d88 : nt!KiProcessExpiredTimerList+0xc6 fffff880`0253dc30
fffff800`01aa41c7 : 000001c5`99d9f3c1 000001c5`03023940
000001c5`99d9f3fd 00000000`00000040 : nt!KiTimerExpiration+0x1be fffff880`0253dcd0
fffff800`01a90a2a : fffff880`02515180 fffff880`025202c0
00000000`00000000 fffff880`01368420 : nt!KiRetireDpcList+0x277 fffff880`0253dd80
00000000`00000000 : fffff880`0253e000 fffff880`02538000
fffff880`0253dd40 00000000`00000000 : nt!KiIdleLoop+0x5a STACK_COMMAND: kb FOLLOWUP_IP: netft!NetftWatchdogTimerDpc+b9 fffff880`030b76a5 cc int 3 SYMBOL_STACK_INDEX: 1 SYMBOL_NAME: netft!NetftWatchdogTimerDpc+b9 FOLLOWUP_NAME: MachineOwner MODULE_NAME: netft IMAGE_NAME: netft.sys DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bc48a FAILURE_BUCKET_ID: X64_0x9E_netft!NetftWatchdogTimerDpc+b9 BUCKET_ID: X64_0x9E_netft!NetftWatchdogTimerDpc+b9 Followup: MachineOwner --------- |
And now all that is left for me to say is: ‘happy debugging’.
Oh here are some helpful links: http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx
http://blogs.msdn.com/b/ntdebugging/archive/tags/hangs/