Health checks matter to the infrastructure that depends on them. But are they also important to developers and operations teams, or just noise?
Akita automatically discovers and monitors all the API endpoints in use. But because this coverage is driven by observations from the network layer, some users’ Akita experiences were dominated by their health checks, burying their real APIs. This led us to develop algorithms for automatically filtering out health check endpoints.
But the ability to automatically characterize APIs as health checks led to other questions. Do we report on health check endpoints in our alerting and summaries, or omit them as irrelevant? If those APIs are experiencing high error rates or high latencies, should we highlight them as outliers, or focus on the core API? Some of us on the team felt that health checks were so ubiquitous and low-information that there was no way they could be meaningful. Others of us had experiences where knowing that a health check was performing badly was a key insight.
In this post, I’ll discuss what health checks are, why and when they are meaningful, how to build good ones, and what some of the alternatives to health checks are. I’ll then talk about why you might want to monitor your health checks.
(After reading this post, please email our team and tell them you agree with me. 🙃)
What are health checks and why should I care?
A primary concern for the infrastructure that runs your applications, and the operations team that manages it, is whether a service is “up” and ready to accept requests. Many network services do this by supporting APIs that don’t do anything but signal that the service is alive. These might be called “health checks” or “pings” or “liveness probes.” I’ll be talking about them today using “health checks.” An API call that doesn’t do anything might not seem important, but it can be critical to the correct operation of your application.
Health checks for internal infrastructure
The primary reason that developers include health check APIs is because they are using infrastructure that requires them to do so! If your service can’t respond to a health check, it won’t get traffic. Most platforms aren’t very smart, and the only way they have to judge your services’ health is to keep asking “are you still there?”
For example, if you’re using a load balancer, a health check API is the mechanism by which that load balancer decides it’s OK to send traffic to a particular instance of your service. If you are supposed to have three servers running, but one of them is offline, you want all the traffic directed to the two remaining servers. The load balancer will send a steady stream of requests to a particular HTTP endpoint, usually something like /healthz or /ping. Servers that don’t respond with a 2xx status code, or don’t respond in a timely fashion, will stop receiving traffic.
Similarly, in Kubernetes, one of the available ways that the system can check that your pod is able to receive traffic is to send an HTTP request. The Kubernetes “kubelet” on each node can check that the pods on that node are all still responding to calls to a /healthz path. If a pod isn’t responding, that pod is marked as unhealthy and restarted.
(Why the “z” suffix? It’s a Google convention that escaped into the rest of the world, to make collisions less likely. If you already had a /health endpoint in your API for unrelated reasons, you don’t need to rename it to start using health checks.)
Health checks for external monitoring
Health checks aren’t just for internal infrastructure. You’ll sometimes also see a health check endpoint that permits an external provider to monitor the availability of your service. A /healthz endpoint for Kubernetes need not be exposed to the rest of the world – there’s no point – but a health check for end-to-end monitoring should be.
For example, you could set up measurement points in different geographic locations and take measurements about how often they can connect to the health-check endpoint, and with what latency. This alerts you both to problems with your service – if it can’t respond – and to problems with naming and routing that make your service unavailable – if the requests can’t even get to your service at all. This “active monitoring” can help with fault isolation by telling you if a problem is specific to a particular geographic region, or tell you about DNS problems that are invisible from within your data centers.
An external health check endpoint is generally aimed at alerting your ops team, rather than causing automatic changes to your infrastructure. But there is a lot of overlap; the same health check can be used to identify sources of failure and to automatically attempt to resolve them.
Building a good health check
A developer writing a health check faces a paradox. A health check must simultaneously be robust and cheap, while also accurately reflecting whether the service is available. If the health check endpoint engages in a complex behavior, it may become less reliable and provoke unnecessary restarts or cause hot-spots as a load balancer struggles to adjust. But, if the health check does the simplest thing and always says a server is OK, end-user performance will suffer as their requests are routed to an otherwise-dead process.
If your service depends on a database, should your health check endpoint attempt to ping the database in turn? If your server requires free disk space to operate, should the health check monitor that requirement? If your API has hundreds of endpoints, should the functionality behind all of them be exercised? Or will that make the health check horribly unreliable and untrustworthy, or have it become a bottleneck to the development team?
My take: the best approach is for each health check to do a single thing, and do it well.
For purposes of active probing, it’s generally useful to simulate real activities by calling the real API endpoints! This requires setting up test data or a test user, which may not represent the whole behavior of the system, but it exercises many of the same code paths. Active probing can be configured to use many different API endpoints, rather than just one.
Liveness checks from a load balancer or from Kubernetes are a different matter. It’s difficult to configure them with the necessary authentication or payloads to make real API calls. Typically all you can configure is a single fixed URL – not authentication headers, cookies, a request body, or any sort of changing fields or identifiers. Because of this, the operation must necessarily be idempotent, and probably read-only, so that it can be successfully called over and over again. Instead of attempting to list all the services’ dependencies, it is sufficient to check that the main event loop of the process is still functioning. Other failure modes can be handled in a more direct way – by crashing! Bringing down the whole process will ensure that the health check stops responding too.
But, it’s possible to get a few basic things correct that expand coverage. The health check should probably use the same TCP port as the rest of the service, for fate-sharing reasons, but need not go through the same logging or authentication middleware. If instead the health check is handled by a dedicated thread on a different socket, then its operation is disconnected from whether other requests are still being handled.
There’s not a hard-and-fast rule here, but health checks should not be over-engineered to the point of becoming a headache to maintain, nor so disconnected from the rest of the program as to lose their utility. It’s fine to have one that returns 201 as long as the process is still handling any HTTP requests.
How health checks became the gold standard
A load balancer or orchestration platform doesn’t necessarily need to use health checks, but there are good reasons why this has become the standard approach in most situations.
One no-overhead alternative approach to health checks is to forward a real inbound request and see if it works, but the results can be ambiguous or even disastrous. In this architectural design, if the server doesn’t respond, then the request can be directed to another of the configured servers, and the server marked as unavailable for some period of time.
But consider what happens when a particular request may trigger a 500 response from the server either because of a software bug or a dependency problem. Should the load balancer cease sending traffic to that particular server just because it encountered a bug? The other servers likely have the same problem. Worse, if a request crashes a server, should the load balancer really replay it to every other configured server? Bringing down every node in your cluster with one weird trick seems like a poor architectural choice.
The API that’s being retried may be non-idempotent and could be costly to retry. A user might get a confusing response when the first call was partially successful. Leaving a retry decision to the client is usually a better option. Separating out the decision to bring a node down from its response to a particular request cleanly separates these responsibilities.
Another alternative approach that’s more commonly seen in continuous integration systems, is for the servers to actively tell the load balancer their state. They “push” notifications instead of waiting for the “pull” of a status check. For an API server, this is a little less desirable because a health check API exercises at least part of the API stack. But it has about equivalent cost and complexity.
Why you should monitor your health checks
Particularly when you’re just starting a new service, the health checks may dominate your reported load and just look like noise. After all, the liveness probes that Kubernetes performs default to one every 10 seconds – so you get 8,640 calls a day just for spinning up a single pod!
In the Akita App, we’ve started identifying health check endpoints and putting them in their own category. But that doesn’t mean we think you should completely ignore them. Here are a few reasons why.
High rates of health check failures make your infrastructure less reliable. The AWS load balancer “fails open” in cases where all its targets report failure. If all the servers are not ready, then traffic is sent to all of them rather than none of them, in case the problem is the health check itself. But this means you might not notice that the system is not working as intended! A server that marks itself unhealthy is treated no differently than the healthy servers, so users may see more 500 and 502 errors than if the load balancer was able to distinguish the two.
Health check failures are leading indicators for other potential issues. Your health check endpoint may show occasional high latencies, either due to queuing or because it performs some nontrivial operation. A health check request might be delayed because the server is busy with many other requests. Or, it might make a database call or some other system check. In either case its latency might affect reliability. For example, Amazon ELB’s timeout on a health check is five seconds. Fail two of them (even if you respond with a 201 eventually) and the entire server will be marked unhealthy. This can cause increased load on other servers within your cluster, or cause them to slow down and also be marked unhealthy.
So, like any other part of your system, you should care that your health check endpoints are stable and reliable. Monitoring them for excess errors or high latencies can alert you to potential instability, or point at problems elsewhere in your service.
Using Akita to monitor health checks
While it’s certainly possible to use any monitoring service to keep an eye on your health check endpoints, what we’ve built at Akita makes it especially easy. If you’re using Docker, container platforms, or Kubernetes, you can simply give Akita access to watch your API traffic and automatically see what endpoints exist, what’s slow, and what’s throwing errors. In addition to plug-and-play monitoring, you get plug-and-play alerts. Especially if you have many health check endpoints, Akita quickly gives you what you need to keep an eye on them. (The latest: not only does Akita now automatically classify health check endpoints to make it easy to hide/unhide them in your dashboards, but you can now correct Akita if it misclassified an endpoint.)
We’re in open beta and you can sign up to use us here.
Photo by Karsten Winegeart on Unsplash.