Sure enough, during the maintenance alarms started going off. Web pages, static or dynamic, were taking in excess of 5 seconds to be served. We checked /etc/nsswitch.conf to make sure forward lookups were being resolved by /etc/hosts instead of DNS:
hosts: files dns
We then inspected the access log and we could see that some entries had host names instead of IP addresses. This was odd since reverse DNS lookups were turned off in Apache:
HostnameLookups Off
A colleague suggested it could be our LogFormat configuration:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
The documentation is pretty clear that %h will only do a reverse lookup if HostnameLookups is turned on:
If HostnameLookups is set to On
, then the server will try to determine the hostname and log it in place of the IP address.
Nevertheless, we replaced %h with %a. That didn't solve the issue - requests still took little more than 5 seconds to be fulfilled. The millisecond part of the response time looked right to me, so I went on to read the resolv.conf man page and found this bit very interesting:timeout:nNow I was pretty much convinced that Apache was playing tricks on us. I double-checked all our Allow statements to make sure we weren't using hostnames, since the manual says it will do a reverse lookup in this case:
sets the amount of time the resolver will wait for a response from a remote name server before retrying the query via a different name server. Measured in seconds, the default is RES_TIMEOUT (currently 5, see <resolv.h>).
This configuration will cause the server to perform a double reverse DNS lookup on the client IP address, regardless of the setting of the HostnameLookups directive.We only had Allow statements for the usual restricted URLs (server-status etc) and even then, were only using IP addresses.
I decided to disable the lookups in source code, since this was starting to look like a bug. In src/main/http_core.c, line 728, I commented the reverse lookup call and replaced it with a fake response:
// hptr = gethostbyaddr((char *)iaddr, sizeof(struct in_addr), AF_INET);
hptr = NULL;
I did the same on line 825:
// hptr = gethostbyaddr((char *)iaddr, sizeof(struct in_addr), AF_INET);
hptr = NULL;
The web server went back to serving static and dynamic objects in the normal time, tens of milliseconds. I deployed the newly compiled Apache to all servers and went back to my weekend.
This hack obviously breaks all functionality that depends on reverse IP address lookups so, if I have some time, I will analyse this condition and, if it is a bug, submit a patch to the chaps at the Apache Foundation.