17:57 Friday, August 08 2008

hunting the killer

When I pulled up my work email this morning, I knew immediately that something very very bad had happened during the night. Instead of the normal \~800 or so new messages I had 10000+. I quick scan showed that our hang-check daemon (which detects when a test system has hung, and auto-reboots it) had decided that all 113 test systems had hung around 8PM yesterday evening, and rebooted them. Every 20 minutes. Until I killed the hang-check daemon just after 6:30AM. The 10k emails were the least of my problems, as all nightly test coverage was lost (its kinda hard to test much when the box is getting rebooted every 20 minutes, and the typical build takes at least 20 minutes).
Two times in the past the hang-check daemon had gone off the reservation. The first time we had no clue wtf it was doing, and after killing & restarting it, it behaved. Then a couple months ago, it did it again, and we determined that the problem was that the server where it runs was exhausting its tcp sockets, and as a result none of the client systems could check in to announce that, no I'm really not hung, every 60 seconds. We thought that we fixed it by kicking the number of tcp socket up from the default of 256 to 512. Until last night.
We spent the better part of the day trying to figure out what was going on. In mid-morning, on a whim, I decided to look at the output from "netstat -an | grep -c tcp" and it returned 539. That made it pretty damn clear that our previous 512 limit was grossly insufficient if a mostly idle server already had 539 sockets in use. The question remained, why? Clearly a mostly idle system shouldn't be using that many sockets.
Another interesting tidbit was that just every one of the open sockets was in the FIN_WAIT state, so these weren't even active connections, they were connections that were just about to die & close. We still weren't sure why there were so many connections in that state, but figured that reducing /proc/sys/net/ipv4/tcp_fin_timeout from its 60 second default to 30 seconds might at least help to get things cleared out faster, plus allow us to turn the hang-check daemon back on (cause, ya know, Windows boxes hang alot, and we kinda need to catch them in real-time so that we don't lose test coverage). So we kicked it down to 30 seconds, the number of open sockets dropped down under 200, then restarted the hang check daemon, and figured that would keep things relatively stable until we figured out the root cause of the mess.
Sadly, things went to hell all over again, with the number of open sockets climbing back over 500 and the hang check daemon turning into psycho-killer again. We noticed that of the connections, about 50% were from the hang-check daemon, but the other 50% were connections to the PostgreSQL database. That was rather disturbing because we knew that we weren't running anything that should have been maintaining a persistent connection.
It was at this point that Kurt thought he found a bug in the hang check daemon where it was reusing a pre-existing connection rather than spawning a new one. His line of thinking was that this was what was keeping all those connections open forever, so he made a change and that did help alot. But the database connections were still hanging around, and we never got below the mid 200's.
At this point, I theorized that perhaps the relatively new automated testing daemon that I wrote might be the problem, since it polls the database every 30 seconds to look for new tests. I figured I'd decrease the polling frequency from 30 to 45 seconds, and at the same time decrease /proc/sys/net/ipv4/tcp_fin_timeout further to 20 seconds.
We went ahead and turned the hang check daemon back on, and it was finally behaving. However, I also created a quick-n-dirty shell script to take a count of the number of open TCP sockets and email us any time it climbed above 490, so at least we'd know when doom was looming, and hopefully see a pattern in the usage and figure out when & what was using up so many TCP sockets.
Shortly before the nightly tests kicked off, we started to get the warning emails. First it was 495, then 600 something, and it peaked at just over 800 before dropping back down. We were completely stumped on what was going on. It was the damn database connections, as the hang check daemon was now being a very good citizen and using around 5 connections at any time.
It was at this point that I started thinking aloud that maybe the Windows (my favorite punching bag) psql client was just not closing its connections properly, and that's why. I figured running "netstat -a" (without the -n) would show me the actual hostnames of the offenders. I was hoping that it would be mostly windows boxes, and at least then I'd have a lead. What I saw was both surprising and disappointing. They were *ALL* windows boxes, but they weren't the nightly test boxes at all. They were the per-changelist test systems instead, and only the Windows boxes. So yes, its Windows that's at fault, but its only a specific class of systems, which means that it can't be the psql client, but something else.
So that's where the mystery left off. We're hoping that the special tuning measures that we deployed today will keep things relatively stable for the time being until I can really figure out wtf is going on. I have to go into work tomorrow, so worst case is that I'll continue hunting the tcp socket killer then.