Monday 22 June 2009

Disk IO spikes caused by mod_proxy.c in Apache 1.3

We have been using Apache 1.3 with mod_proxy as a reverse proxy for some of our applications. The configuration is pretty simple:

CacheRoot /var/adm/apache/cache
CacheSize 4096000
CacheGcInterval 4
CacheMaxExpire 48
CacheDefaultExpire 2

The above tells Apache to inspect the cache directory every 4 hours, keep up to 4GB worth of data, validate objects every 48 hours (no matter what the origin server set the expiry to) and expire them, by default, in 2 hours.

The typical content of the cache directory for our sites is 4.5GB in about 180,000 directories and 310,000 files.

Shortly after deploying this to production we started seeing the pattern below for the load on all web servers:
Because this was happening in regular intervals it was easy to inspect one of the servers during the load spike. The sar data had already told us the load was caused by disk IO:

$sar -b
(...)
tps rtps wtps bread/s bwrtn/s
11:30:01 AM 94.53 1.89 92.63 27.68 1275.66
11:40:01 AM 1289.34 57.63 1231.71 555.35 20566.36
11:50:01 AM 79.16 1.23 77.93 21.88 1084.00
12:00:02 PM 94.21 1.77 92.44 39.52 1274.74
12:10:01 PM 84.18 1.64 82.54 35.54 1116.60
12:20:01 PM 117.43 2.62 114.81 44.68 1550.87
12:30:01 PM 150.65 3.30 147.35 82.21 1986.41
12:40:01 PM 1228.90 1.53 1227.37 33.04 20592.15
12:50:01 PM 70.46 2.65 67.80 100.57 917.25
When the load spiked I checked which processes were high on "wa" (waiting for IO) time and listed their open files with lsof.

Sure enough I could see httpd processes with changing files open in /var/adm/apache/cache. This is expected since the cache garbage collector needs to inspect cached objects and decide what to delete in order to stay under the CacheSize limit.

What surprised me is how amazingly expensive this operation is on ext3 filesystems.

To alleviate the problem I have changed some of the filesystem parameters. In /etc/fstab I told the OS to not update access times on files and directories under /var:
/dev/VolGroup00/LogVol04    /var   ext3    defaults,noatime,nodiratime        1 2
Without noatime and nodiratime every file and directory read meant a disk write to update the access time. Since we don't have a use for access times in /var I did away with them.

With tune2fs I change the data option from the default (ordered) to writeback:
# tune2fs -o journal_data_writeback /dev/mapper/VolGroup00-LogVol04
The writeback mode can cause file corruption on unclean shutdowns but the benefits outweights the risk for us. Our web content in /var is replicated from a master server so we would only be looking into losing non-critical data such as some state files and access log entries.

After these changes the GC is barely noticed. If you pay attention to the graph above you will see that there is no load spike for 13:00 onwards.