We may also see VMkernel logs a lot of SCSI reservation conflicts error, If storage is slow or any other underlying issue with Storage.
Jan 19 21:08:33 esx-server-xxx vmkernel: 401:11:11:15.287 cpu0:1043)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Jan 19 21:08:33 esx-server-xxx vmkernel: 401:11:11:15.287 cpu0:1043)WARNING: FS3: 4784: Reservation error: SCSI reservation conflict
Jan 19 21:08:34 esx-server-xxx vmkernel: 401:11:11:16.492 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 64
Jan 19 21:08:35 esx-server-xxx vmkernel: 401:11:11:17.468 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 48
Jan 19 21:08:36 esx-server-xxx vmkernel: 401:11:11:18.423 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 32
Jan 19 21:08:37 esx-server-xxx vmkernel: 401:11:11:19.366 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 16
Jan 19 21:08:38 esx-server-xxx vmkernel: 401:11:11:20.419 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 0
Jan 19 21:08:38 esx-server-xxx vmkernel: 401:11:11:20.419 cpu0:1043)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Jan 19 21:08:38 esx-server-xxx vmkernel: 401:11:11:20.419 cpu0:1043)WARNING: FS3: 4784: Reservation error: SCSI reservation conflict
ESX uses SCSI reservations a locking mechanism to share a luns between ESX hosts. These reservations are released when any of the activity mentioned below is completed. VMkernel regularly monitors for any aged reservations and it tries to release the aged lock. If another ESX hosts is using the lun actively, it can try to reclaim the lun or to place another reservation.This scsi reservations are needed to prevent any data corruption in environment where storage luns are shared between multiple esx hosts. whenever ESX host tries to update VMFS metadata, it puts the SCSI reservations on it. When multiple hosts try to reserve the same lun at same time, a reservation conflict occurs. If the number of reservations conflicts is to big then ESX will fail the I/O.SCSI reservation errors can be a sign of san latency failures.
Resolution
Below Steps may or may not resolve your issue but definitely below are the steps which we need to perform in order to understand the root cause better. If you are continuously facing these behavior in Linux virtual machines, you can try the below steps :
1. Verify VMware tools are up to date.
2. Migrate the affected virtual machine to other datastore and monitors the virtual machine. If issue doesn’t reappears , It could be problem with the storage. Engage the storage vendor.
2. Even after the storage migration, if issue reoccurs .Update the Linux kernel to the latest version
3. Increase the SCSI timeout of each disk presented from VMWare as per the Redhat Linux article.
Perform the above steps to few of the affected Linux virtual machines and continuously monitors the Virtual machines for the reoccurring of the issue. if the issue reoccurs, then follow the below steps:
1. Reboot the ESX server
2. Perform the LUN reset using the below command
vmkfstools -L lunreset /vmfs/devices/disks/device_ID
3. Reboot the storage processor.
4. Delete the affected Datastore from ESX and also destroy the LUN from storage end. Recreate the array and present it to the ESX servers then create the new datastore and place the virtual machines on it and monitor the virtual machines.
Above mentioned steps are the initial steps for troubleshooting. Which may or may not fix your issue.Below are the list of article which discuss about the same behavior.
http://communities.vmware.com/thread/58081
https://access.redhat.com/site/solutions/21374
Thanks for Reading !!!!