Saturday 10 March 2012

Using varnish to increase the cache time of slow pages

If you are having trouble getting your organisation to accept performance as a feature, one solution is to tie it to another feature.

A common discussion between content owners ("the business") and web operators is how long to cache pages for.  The content owners want to see their fresh content in minimal time but web operators want to cache expensive (to generate) content longer.

Usually content owners get their way, as they should, or you wouldn't have a business.  So you set the default cache time on all pages to something low, like 10 minutes.  If you are not quite ready to take the ESI leap, an attractive compromise is to penalise only expensive pages with a higher expiry time.

If your server backend runs on Ruby-On-Rails, the job is already half-done since requests return a X-Runtime HTTP header indicating, in milliseconds, how long it took RoR to generate the page.  This header is either available in most web frameworks or would be quite easy to implement.

We use the VCL code below to override the internal varnish TTL for slow-to-generate pages.


# VCL 2.1.5 to extend TTL on pages which take 1s or longer to generate
if (beresp.http.X-Runtime ~ "[0-9]{4,}") {
    if (beresp.ttl < 1d) {
        set beresp.ttl = 1d;
        set beresp.http.X-Cache-Override = "1d";
    }
}

We choose not to modify the original headers, which are sent to the client, because this gives you the opportunity to keep or ban the content from Varnish while letting the browsers come back to ask for the "new" content.

The code also sets a "X-Cache-Override" header so you can tell which pages are taking 1 second or more to generate.

This is great because now, when someone complains that their content isn’t refreshing, we can improve performance to areas of the site that matter to the business instead of the whole “make the site faster” approach.

The results will come quickly.  Two days after putting this in production Google has shown that the average page load time in our (very large) site dropped half a second. As you can see in the graph below, the trend is still going down (click on the image).
Site speed improvement as perceived by the Google bot

A further improvement to this would be to write an algorithm which sets the TTL based on the page generation time.  For example, add 1 day to the TTL for each second it takes the page to generate so if a page takes 5 seconds to generate, keep it for 5 days.