ESXi PSOD is always the scary things to the VMware Administrators. ESXi PSOD is similar to Blue Screen of Death in Windows for ESXi Host. A Purple Screen of Death (PSOD) is a diagnostic screen with purple background that is displayed when the VMkernel of an ESX/ESXi host experiences a critical error, becomes inoperative. It brings all the virtual machine running on that host to down. Then VMware HA needs to restart the failed VM’s to other ESXi host in the cluster to bring it back online. What to know what is New with vSphere 6.5 HA . Definitely It causes the downtime to your production virtual machines. You have also need to reboot your ESXi host to recovery from ESXi PSOD.
ESXi PSOD shows the details of memory state at the time of host crash and it has other information such as ESXI build and vresion along with the execption type. It also shows what was running on each CPU at the time of crash , backtrace and error messages and information about core dump. The core dump (or memory dump) is a file that contains further diagnostic information from a PSOD that can be given to VMware support to determine a root cause analysis for the failure.
A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). Below are the list of some of the Exceptions.
- Exception Type 0 #DE: Divide Error
- Exception Type 1 #DB: Debug Exception
- Exception Type 2 NMI: Non-Maskable Interrupt
- Exception Type 3 #BP: Breakpoint Exception
- Exception Type 4 #OF: Overflow (INTO instruction)
- Exception Type 5 #BR: Bounds check (BOUND instruction)
- Exception Type 6 #UD: Invalid Opcode
- Exception Type 7 #NM: Coprocessor not available
- Exception Type 8 #DF: Double Fault
- Exception Type 10 #TS: Invalid TSS
- Exception Type 11 #NP: Segment Not Present
- Exception Type 12 #SS: Stack Segment Fault
- Exception Type 13 #GP: General Protection Fault
- Exception Type 14 #PF: Page Fault
- Exception Type 16 #MF: Coprocessor error
- Exception Type 17 #AC: Alignment Check
- Exception Type 18 #MC: Machine Check Exception
- Exception Type 19 #XF: SIMD Floating-Point Exception
- Exception Type 20-31: Reserved
- Exception Type 32-255: User-defined (clock scheduler)
In this article , we are going to talk particular about ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers. This issue occurs in ESXi hosts running 5.5 p10, 6.0 p04, 6.0 U3, or 6.5 GA may fail with a purple diagnostic screen caused by non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.
ESXi PSOD – non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.
As per the VMware KB Article 2149043, The root-cause is not yet determined and it is still under investigation by VMware and HPE. You can also take a look at HPE advisory c05392947 for latest update. I would always recommend you to open case with GSS to get your ESXI host analyzed before applying any fix to your ESXi hosts PSOD issue.
esxcli system settings kernel list -o iovDisableIR
esxcli system settings kernel set --setting=iovDisableIR -v FALSE
esxcli system settings kernel list -o iovDisableIR