Sunday, September 14, 2014

Analyzing a server 2008 R2 dwp crash dump file

Yesterday the four node file cluster resource crashed and blue screened and was moved to another node. I wanted to analyze the crash dump file (C:\Windows\Minidump\070711-36473-01.dmp) so I copied it to my W7 workstation and tried to open it but Visual Studio could not help me out here.
Reading a crash dump file is far from intuitive and I spent a great deal of the morning learning about debugging. So here is what I did to read the dump file.
First, you need to install the debugging tools from here. Choose the version that corresponds to your architecture. This install will take a long time depending on your network speed. Important is that you include the WinDbg.exe because that is the tool we will be using.
Next, you need to download the symbol files. Note that you can also use the symbol server from Microsoft but it is faster to have a copy of the symbol files on your hard drive. Download them here. Just download them all. And this will also take a long time because the Symbol files are huge.
Next! Open C:\Program Files\Debugging Tools for Windows (x86)\WinDb.exe.
Choose File -> Open -> Symbol File Path

Type: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols like this:

Now press CTRL+D to open the DWP file! Very exciting.

Now, if you enter !analyze -v like this:
And you’ll get more information about the crash. In my case:
8: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
USER_MODE_HEALTH_MONITOR (9e)
One or more critical user mode components failed to satisfy a health check.
Hardware mechanisms such as watchdog timers can detect that basic kernel
services are not executing. However, resource starvation issues, including
memory leaks, lock contention, and scheduling priority misconfiguration,
may block critical user mode components without blocking DPCs or
draining the nonpaged pool.
Kernel components can extend watchdog timer functionality to user mode
by periodically monitoring critical applications. This bugcheck indicates
that a user mode health check failed in a manner such that graceful
shutdown is unlikely to succeed. It restores critical services by
rebooting and/or allowing application failover to other servers.
Arguments:
Arg1: fffffa8038f3ab30, Process that failed to satisfy a health check within the
 configured timeout
Arg2: 00000000000004b0, Health monitoring timeout (seconds)
Arg3: 0000000000000000
Arg4: 0000000000000000
Debugging Details:
------------------
PROCESS_OBJECT: fffffa8038f3ab30
CUSTOMER_CRASH_COUNT:  1
DEFAULT_BUCKET_ID:  DRIVER_FAULT_SERVER_MINIDUMP
BUGCHECK_STR:  0x9E
PROCESS_NAME:  System
CURRENT_IRQL:  2
LAST_CONTROL_TRANSFER:  from fffff880030b76a5 to fffff80001a98d00
STACK_TEXT:
fffff880`0253d518 fffff880`030b76a5 : 00000000`0000009e fffffa80`38f3ab30 00000000`000004b0 00000000`00000000 : nt!KeBugCheckEx
fffff880`0253d520 fffff800`01aa4652 : fffff880`0253d600 00000000`00000000 00000000`40800088 00000000`00000001 : netft!NetftWatchdogTimerDpc+0xb9
fffff880`0253d570 fffff800`01aa44f6 : fffff880`030c4100 00000000`03023940 00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x66
fffff880`0253d5e0 fffff800`01aa43de : 00000729`6e09a2ce fffff880`0253dc58 00000000`03023940 fffff880`02517d88 : nt!KiProcessExpiredTimerList+0xc6
fffff880`0253dc30 fffff800`01aa41c7 : 000001c5`99d9f3c1 000001c5`03023940 000001c5`99d9f3fd 00000000`00000040 : nt!KiTimerExpiration+0x1be
fffff880`0253dcd0 fffff800`01a90a2a : fffff880`02515180 fffff880`025202c0 00000000`00000000 fffff880`01368420 : nt!KiRetireDpcList+0x277
fffff880`0253dd80 00000000`00000000 : fffff880`0253e000 fffff880`02538000 fffff880`0253dd40 00000000`00000000 : nt!KiIdleLoop+0x5a
STACK_COMMAND:  kb
FOLLOWUP_IP:
netft!NetftWatchdogTimerDpc+b9
fffff880`030b76a5 cc              int     3
SYMBOL_STACK_INDEX:  1
SYMBOL_NAME:  netft!NetftWatchdogTimerDpc+b9
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: netft
IMAGE_NAME:  netft.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  4a5bc48a
FAILURE_BUCKET_ID:  X64_0x9E_netft!NetftWatchdogTimerDpc+b9
BUCKET_ID:  X64_0x9E_netft!NetftWatchdogTimerDpc+b9
Followup: MachineOwner
---------
Explanation: USER_MODE_HEALTH_MONITOR (9e) is the bug check code I need to investigate. For a complete list of bugcheck codes look here: http://msdn.microsoft.com/en-us/library/ff542347%28v=VS.85%29.aspx
And now all that is left for me to say is: ‘happy debugging’.
Oh here are some helpful links: http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx

http://blogs.msdn.com/b/ntdebugging/archive/tags/hangs/

Explaining DNS Concepts - DNS Servers-DNS Queries-DNS Records

3 types of DNS queries— recursive, iterative, and non-recursive 3 types of DNS servers— DNS Resolver, DNS Root Server and Authoritative Name...