Tuesday 27 July 2010

Varnish and the Linux IO bottleneck

There are 2 main ways Varnish caches your data:
  1. to memory (with the malloc storage config)
  2. to disk (with the file storage config)
As Kristian explains in a best practices post, you get better performance with #1 if your cache size fits in RAM. With either method the most accessed objects end-up in RAM. While the malloc method puts everything in RAM and the kernel swaps it out to disk, the file storage method puts everything in disk and the kernel caches it in memory.

We have been using the file storage for a little more than 3 months on servers with 16GB of RAM and a file storage of 30GB. Recently, because of an application configuration error, the size of the cache grew from its usual maximum of 6GB to 20GB+.

Since the system only has 16GB, the kernel now needs to choose which pages to keep in RAM. When a dirty mmap'ed page needs to be released the kernel will write the changes to disk before doing so. More than that, the linux kernel will proactively write dirty mmap'ed pages to disk.

On a Varnish server with this setup or any other application that uses mmap() for large files, the situation translates to constant disk activity.

At some point the varnish worker process will be blocked by a kernel IO call. The Varnish manager process, with no way to tell what's happening to the child, believes it has stopped responding and kills it. The log entries below are typical:

00:30:10.395590 varnishd[22919]: Child (22920) not responding to ping, killing it.
00:30:10.395622 varnishd[22919]: Child (22920) not responding to ping, killing it.
00:30:10.417309 varnishd[22919]: Child (22920) died signal=3

At this point you lose your cached data and the system goes back to normal. After some time, the size of the cache will grow to be larger than your server's RAM again, repeating the kill-restart cycle.

You could choose to give the client more time to respond to the manager process' ping requests. This is done by increasing the default value of cli_timeout from 10 seconds (e.g. varnish -p cli_timeout=20) but this is merely masking the issue. The real issue is that the Linux kernel is busy writing dirty pages to disk.

There are parameters you can use to control how much time the kernel spends writing dirty mmap pages to disk. I have spent some time fine-tuning the parameters below with little results:

/proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_ratio
/proc/sys/vm/dirty_background_ratio

In RHEL 5.x kernels you can completely disable committing changes to mmap'ed files:

echo 0 > /proc/sys/vm/flush_mmap_pages

Short of reducing the file storage size to fit in RAM, this is the best solution I found so far. Be aware that using this with the upcoming persistent storage in Varnish is a really bad idea as you risk serving corrupt and/or stale data. This is only acceptable because the mmap'ed data, in this setup, is superfluous: varnish throws away the cache on restart and it can always fetch the object from the backend.

I have asked Red Hat if they plan to keep the flush_mmap_pages setting in RHEL 6 but haven't received a response yet.  They did, however, confirm that msync() calls are honoured and dirty pages being evicted from RAM are committed to disk.

1 comment:

Ole Laursen said...

Have you tried the simple cache in nginx? Would be interested in hearing your results.

It's using the file system for storing individual objects, thus side-stepping the whole issue of allocating a big honk of memory and having to use a hoard of threads to scale. I read somewhere that Varnish is using the big honk to avoid the overhead of the file system, but actually nginx is faster at caching than Varnish in my simple tests.

Of course, relying on the disk is only ever going to work if you only need it sporadically, if you have a constant high rate of random writes on a spinning disk, you're toast no matter what you do. I guess an intelligent cache would try to detect this and start throwing away objects rather than saving them. nginx has a simple proxy_cache_min_uses you can set to at least catch a simple each-object-is-different problem.