Cleaning out phantom gearman workers
Posted on: 22 July 2014
We've been using the excellent Gearman in our production systems to handle distributed job processing, and it's been working very well for us. However, just occasionally it will throw a weird one our way.
Today we were setting up some new worker processes on Digital Ocean virtual servers - we've found that their tiny instances give the best price/performance ratio for CPU-heavy processes which can be easily parallelised. We got the first one up and running nicely, and everything seemed good. The trouble started when we restarted that worker server, and the workers wouldn't start up again.
Every worker for a function needs to register a unique client ID with the gearman server. In our case we've chosen to use the hostname and a sequential index. When the worker server first started up, it registered workers as "my-hostname-0" and "my-hostname-1" with the gearman server. When we restarted that worker server, for some reason, those registrations were not cleared out of the gearman server. It then tried to register "my-hostname-0" and "my-hostname-1" again, and the gearman server refused the registrations as duplicates.
Unfortunately, the gearman server only has a very limited range of commands you can send to a running server. There is an option to kill all workers running a specific function, but this particular function is already being handled by several other production servers - we just want to add some additional capacity,
Time to go down a level.
We know that the gearman worker uses TCP to communicate with the gearman server. We also know that we've rebooted the worker server, so there won't be any trace of the connection left there. However, there should be a trace left on the gearman server, and we should be able to see it with our old friend netstat:
# netstat -tonp | grep xxx.xxx.xxx.xxx tcp 0 0 xxx.xxx.xxx.xxx:4730 xxx.xxx.xxx.xxx:45915 ESTABLISHED 35622/gearmand keepalive (6982.94/0/0) tcp 0 0 xxx.xxx.xxx.xxx:4730 xxx.xxx.xxx.xxx:45916 ESTABLISHED 35622/gearmand keepalive (6982.94/0/0) tcp 0 0 xxx.xxx.xxx.xxx:4730 xxx.xxx.xxx.xxx:45917 ESTABLISHED 35622/gearmand keepalive (6982.94/0/0)
and there we are - two worker connections, plus a management connection from the worker box. Now we just need some way to kill off those connections. tcpkill sounds like a good candidate, but unfortunately it will only kill a connection when traffic next passes through it. As our connection is completely dead, this isn't going to work for us.
A bit of digging around (with a helpful pointer) reveals killcx - a handy old-school Perl utility which can kill connections even when they are not doing anything, by creating a fake SYN packet. And it does exactly what it says on the tin. Connections are killed, the Gearman server de-registers, those connections, then the worker is able to register with those unique client IDs, and everybody breathes a sigh of relief.
Now I need to work out why those connections aren't cleared when the worker server is rebooted, but that's a problem for tomorrow.