ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers

Mohammed Raffic

8 years ago

ESXi PSOD is always the scary things to the VMware Administrators. ESXi PSOD is similar to Blue Screen of Death in Windows for ESXi Host. A Purple Screen of Death (PSOD) is a diagnostic screen with purple background that is displayed when the VMkernel of an ESX/ESXi host experiences a critical error, becomes inoperative. It brings all the virtual machine running on that host to down. Then VMware HA needs to restart the failed VM’s to other ESXi host in the cluster to bring it back online. What to know what is New with vSphere 6.5 HA . Definitely It causes the downtime to your production virtual machines. You have also need to reboot your ESXi host to recovery from ESXi PSOD.

ESXi PSOD shows the details of memory state at the time of host crash and it has other information such as ESXI build and vresion along with the execption type. It also shows what was running on each CPU at the time of crash , backtrace and error messages and information about core dump. The core dump (or memory dump) is a file that contains further diagnostic information from a PSOD that can be given to VMware support to determine a root cause analysis for the failure.

A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). Below are the list of some of the Exceptions.

Exception Type 0 #DE: Divide Error
Exception Type 1 #DB: Debug Exception
Exception Type 2 NMI: Non-Maskable Interrupt
Exception Type 3 #BP: Breakpoint Exception
Exception Type 4 #OF: Overflow (INTO instruction)
Exception Type 5 #BR: Bounds check (BOUND instruction)
Exception Type 6 #UD: Invalid Opcode
Exception Type 7 #NM: Coprocessor not available
Exception Type 8 #DF: Double Fault
Exception Type 10 #TS: Invalid TSS
Exception Type 11 #NP: Segment Not Present
Exception Type 12 #SS: Stack Segment Fault
Exception Type 13 #GP: General Protection Fault
Exception Type 14 #PF: Page Fault
Exception Type 16 #MF: Coprocessor error
Exception Type 17 #AC: Alignment Check
Exception Type 18 #MC: Machine Check Exception
Exception Type 19 #XF: SIMD Floating-Point Exception
Exception Type 20-31: Reserved
Exception Type 32-255: User-defined (clock scheduler)

In this article , we are going to talk particular about ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers. This issue occurs in ESXi hosts running 5.5 p10, 6.0 p04, 6.0 U3, or 6.5 GA may fail with a purple diagnostic screen caused by non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.

ESXi PSOD – non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.

As per the VMware KB Article 2149043, The root-cause is not yet determined and it is still under investigation by VMware and HPE. You can also take a look at HPE advisory c05392947 for latest update. I would always recommend you to open case with GSS to get your ESXI host analyzed before applying any fix to your ESXi hosts PSOD issue.

The issue was triggered by a change in ESXi 5.5 p10, 6.0 p04, 6.0 U3 and, 6.5 GA in which ESXi disables the Intel IOMMU’s (aka VT-d) interrupt remapper functionality. In HPE ProLiant Gen8 servers, this change is causing PCI errors which result in the platform generating an NMI and causing the ESXi host to fail with a purple diagnostic screen. There is a workaround provided by the VMWare KB article2149043. Let’s see the detailed step by procedure how to workaround this issue. Workaround for this issue is to re-enable the Intel® IOMMU interrupt remapper on the ESXi host.

1. Connect to your ESXi host using SSH

2. Validate the current iovDisableIR settings in the ESXi using the below command

esxcli system settings kernel list -o iovDisableIR

Currently IODisbaleIR is set to TRUE . We need to set it to False.

3. Run the below command to re-enable the Intel IOMMU interrupt remapper on the ESXi host

esxcli system settings kernel set --setting=iovDisableIR -v FALSE

4. Reboot the ESXi host

5. Revalidate the iovDisableIR setting is set to FALSE by running this command:

esxcli system settings kernel list -o iovDisableIR

That’s it. We are doe with executing the workaround action for ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers. I hope this is informative for you. Be social and share it in social media, if you feel worth sharing it.