We have owned a pair of Kemp 2500 Network Load Balancer for some time now. One thing I noticed after an update was I was getting alerts from the load balancer telling me that my primary balancer was unresponsive. Being a production balancer you can imagine no one wants to get this kind of message during peak times. The first time I recieved this message I was very anxious not knowing what to do. However there is plenty of information on the internet on how to resolve this issue. Being that the Kemp Load Balancer’s are built on a Linux server, the suggestions out on the internet helped tons. So I called support and they helped me to increase the values of GC_Thresh1,2 and 3. This was pretty simple and straight forward, but far from over.

So I won’t make the how to’s that are already widely available more redundant, instead I am writting this to put out a scenario where after these values were increased, I started to get that same issue happening again.

I could not believe it. I thought for sure this was fixed by increasing the values. According to Kemp they had tested these balancer’s on a class A network. So how is it that my class B is throwing everyone off and in fact freezing up again due to an overflow of the ARP table.  After running a TCP dump of only ARP requests on the balancer for 14 hours, we noticed that each ARP request were getting tripled because 3 of the 4 Nic’s on the balancer had address’ assigned to them that all go back to a single switch.

Being that the network design is flat with no VLAN’s, all ARP requests will come in every NIC. If the NIC’s were on seperate VLAN’s then the issue would not have happened, however it is very hard to go back and change a network design after it has been in place for several years. So how could this happen? it is like a broadcast storm or an ARP flood, but we found out that actually it was a utility that was being run to find all MAC address’ on the network and it’s associated IP address. This program CC Get MAC Address, floods the network with ARP requests. While every server and PC seem to handle this flood fine, the balancer’s on the other hand struggle. I would have thought that the balancer would dispose of the packets if the requests do not pertain to it, but in fact it caches the request, at an alarming rate causing the table to overflow.

So in short if you have this happening even after your threshold values have been increased, make sure no utilities are being run that will flood the network. It will save you some serious time and headaches.