A couple of days ago I had a sudden panic when all of a sudden my Virtual Machines began to shut down. You can only imagine how bad this can be when a fully redundant system, simply fails. As I began looking into the issue, I start to receive errors that some of my LUN’s on my brand new EVA had reached capacity.
Baffled as you can imagine I began to look at my usage in VMware and found that I still had approx 180 Gig available. When I opened up the datastore I noticed that my VM’s had a massive amount of snapshots that have piled up in the directories.
We use a utility called Backup Exec from Symantec to provide a backup utility for our entire production environment. One of the plugins for Backup Exec allows you to make backups of your VMDK’s. When BE makes backups of these files it makes a call to VMware and creates a snapshot of the VM at that moment. If for whatever reason the backup job fails, the snapshot is not deleted and it becomes the current working version of the VM. After several failed jobs this began to pile up for the VM’s and the LUN ran out of room.
Well as you remember I told you that when I looked at the LUN I had 180 Gig available. Well unknown to me that coupled with the fact that my LUN had run out of space but now VMware was not reporting my usage correctly. What this caused was a failure of the alarms triggering to tell me that my LUN utilization was getting high.
After calling VMware to assist me with getting my environment back online and clear out all my snapshots, I found the issue with the alarms not triggering. The recommendation was to edit your datastore alerts and make a change to it in some way so that when you click ok, VCenter server will reset the trigger and start to poll the actual datastore size at that moment.
After all is said and done I have learned and am now passing on to all of you. Always check your backups and make sure that there is no snapshots left behind that were not cleaned up. I have also learned to double check my usage stats and look more closely at VSphere client for anomalies.