Monday, December 11, 2017

We have been doing health checks wrong in Plone

In the previous post of this blog I was arguing on how to increase the performance of high-traffic Plone sites. As mentioned there, we have different ways to do so: increasing the number of server threads, increasing the number of instances, and increasing both.

Increasing the number of threads is easier and consumes less memory but, as mentioned, is less resilient and can be affected by the Python GIL on multicore servers. Increasing the number of instances, on the other side, will increase the complexity of our stack as we will need to install a load balancer, a piece of hardware or software that distributes the incoming traffic across all the backend instances.

Over the years we have tried different software load balancers depending on the traffic of a site: we use nginx alone as web server and load balancer on smaller sites and we add Varnish as web accelerator and load balancer when the load increases; lately we started using HAProxy again to solve extreme problems on sites being continuously attacked (I'll write about on a different post).

When you use a load balancer you have to do health checking, as the load balancer needs to know when one of the backend instances has become unavailable because is quite busy to answer further requests, has been restarted, is out for maintenance, or is simply dead. And, in my opinion, we have been doing it wrong.

The typical configuration for health checking is to send requests to the same port used by the Zope HTTP server and this has some fundamental problems: First, the Zope HTTP server is slow to answer requests: it can take hundreds of milliseconds to answer even the most simple HEAD request.

To make things worst, the requests that are answered by the Zope HTTP server are the slower ones (content not in ZODB cache, search results, you name it…), as the most common requests are already being served by the intermediate caches. In the tests I made, I found that even a well configured server running a mature code base can take as many as 10 seconds to answer this kind of requests. This is a huge problem as health check requests start queuing and timing out taking perfectly functional instances out of the pool, and making things just worst.

To avoid this problem we normally configure health checks with long intervals (typically 10 seconds) and windows of 3 failures for every 5 checks. And this, of course, creates another problem as the load balancer takes, in our case, up to 30 seconds to discover that an instance has been restarted when using things like Supervisor and its memmon plugin, leading to a lot of 503 Service Unavailable errors in the mean time.

So, it's a complete mess no mater how you analyze it and I needed to find a way to solve it: enters five.z2monitor.

five.z2monitor plugs zc.monitor and zc.z3monitor into Zope 2, enabling another thread and port to handle monitoring. To install it you just need to add something like this into your buildout configuration:

[buildout]
eggs =
    …
    five.z2monitor
zcml =
    …
    five.z2monitor

[instance]

zope-conf-additional =
    <product-config five.z2monitor>
        bind 127.0.0.1:8881
    </product-config>


After running buildout and restarting your instance you can communicate with your Zope server over the new port using different commands, called probes:

$ bin/instance monitor help
Supported commands:
  dbinfo -- Get database statistics
  help -- Get help about server commands
  interactive -- Turn on monitor's interactive mode
  monitor -- Get general process info
  ok -- Return the string 'OK'.
  quit -- Quit the monitor
  zeocache -- Get ZEO client cache statistics
  zeostatus -- Get ZEO client status information


Suppose you want to get the ZEO client cache statistics for this instance; all you have to do is use the following command:

$ bin/instance monitor zeocache main
417554 895451465 435095 900622900 35429160


You can also use the Netcat utility to get the same information:

$ echo 'zeocache main' | nc -i 1 127.0.0.1 8881
417753 896710955 435422 901905068 35467686


It's easy to extend the list of supported commands by writing you own probes; in the list above I have added to the default command set one that I create to know if the server is running or not; it's called "ok", and here is its source code:


We have now a dedicate port and thread that can be used for health checking:

$ echo 'ok' | nc -i 1 127.0.0.1 8881
OK


With this we solved most of the problems I mentioned above: we have faster response time and no queuing; we can decrease the health check interval to a couple of seconds and we are almost sure that a failure is a failure and not just a timeout.

Note we can't use this with nginx, nor Varnish, as their health checks are limited and expect the same port used for HTTP requests; only HAProxy supports this configuration.

So, in the next post I'll show you how to configure HAProxy health checks to use this probe and how to reduce the latency to a couple of milliseconds.

No comments:

Post a Comment