Varnish trick: Serve stale content while refetching

Here is a small trick we recently implemented for a customer:

The main premise was:

No clients should have to wait while the backend works. If a request is a miss, give the client a slightly stale/old page instead, and fetch a new page in the background.

Since Varnish and VCL is super configurable, we can do this with a VCL hack and a small helper process.

The flow is that a client requests something that just expired. In vcl_miss we notice this, and change the backend to a sick one. We also log the URL that just failed with std.log(), before restarting the request handling. Back in vcl_recv the usual sick-backend behaviour kicks in and the slightly stale graced object is given to the client.

Outside Varnish there is a small python script that tails varnishlog output for a special VCL_Log entry. When it picks it up, it sends a request for the same URL to the local Varnish. In vcl_recv we detect this client, set req.hash_always_miss to force a refetch and let the python script wait while the backend recreates the page.

All requests that come in while the refetch is underway will be served graced copies at full speed.

Your 95 percentile response time graphs will love this feature, and maybe even some of your users as well. Cool, huh?

More information about grace in Varnish:

https://www.varnish-software.com/static/book/Saving_a_request.html#core-grace-mechanisms

This entry was posted in stuff and tagged , , . Bookmark the permalink.

9 Responses to Varnish trick: Serve stale content while refetching

  1. citizen kane says:

    And how would you know if the requested object is outdated as compared to not in the cache? The later case would leave your user with a 503 after the 3 restarts. If you would catch that too, let’s say by letting the real user be the one to way after the 2nd restart, you would end up having the user to wait AND having the backend hit twice for the same content.

    • Dear Mr. Kane.

      Since the helper process client always will get hash_always_miss set, the backend can in those occurences get hit twice.
      In practice, given sane TTL and grace values, it doesn’t matter.

      For the said customer TTL was around a minute or two, with matching grace.

      I guess in the cold/restarted cache scenario it would be an optimization to let the helper process wait a TTL or two before starting up, but for the customer case that wasn’t a big issue.

  2. kalim says:

    Hi Lasse
    Can you please provide some sample example to implement this

    • Hi Kalim.

      If you use Varnish 4 you get this kind of behaviour built in.
      Asynchronous background fetch is the default there, no magic needed.

      -Lasse

      • kalim says:

        Hi Lasse,
        Project architecture uses varnish 3.0.5. Do you have any idea, what can be done in that case.
        Regards
        Kalim

      • If this is an important factor for you and the project, I’d recommend that you either find a consultant that understands enough VCL to replicate this, or fix your project architecture to allow for 4.0.

  3. kalim says:

    Hi Lasse, I upgraded to 4 and I guess it is working fine. I set the ttl to 5 mins and I could not see any miss after 5 mins. I think now it is continuously caching.
    I have not done any extra changes as of now. I hope that is fine?

    • Hi. Varnishlog will tell you what is happening. The parameter default_grace (default 10s) defines how old (beyond TTL) the object is kept around. If your backend uses more than 10s, you should increase that parameter. Otherwise, you’re good to go.

  4. Kalim says:

    Hi Lasse,
    Got this thing work right… Thanks…
    Need a small favor.. I am getting frequent Internal 500 error. This is happening majorly on homepage. There are so many modules and page size is also heavy (around 4.5 MB).
    How can i check the root cause of it. I checked the logs but nothing logged related to this.
    To fix it temporarily i need to ban the URL through varnish admin & then it works fine.
    Is this related to time out. Can I tune some parameters for this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s