The Case for API-Centric Observability

by

Mark Gritter

Many dogs on many leashes, an analogy for per-endpoint API monitoring and observability.

We hear your story a lot.

You’re in charge of a handful of backend services at your company, maybe all of them. You don’t have the luxury of having modern DevOps practices in place. You tell yourself “yet” while knowing that things aren’t going to change soon.

Chances are, you’ve fallen into the API monitoring gap: you’ve set up monitoring at the level of entire services, but your day-to-day problems are with individual endpoints. You’ve likely set up a monitoring solution like AWS CloudWatch that reports only service-level error count, volume, and latency. You’ve looked into doing the work of setting up Datadog, Grafana, Honeycomb, or LightStep—and maybe you’ve even started to do so, but you’ve discovered that is a lot of work to do to answer the simple question you care about: which endpoints are doing poorly and how poorly? As a result, you’re getting more alerts than you actually need and you’re spending a lot of time root-causing these alerts.

I’m here to tell you that the API monitoring gap is not your fault. Your existing tools are monitoring at the wrong level of abstraction, but your aspirational tools will take a large effort to set up and maintain. Towards bridging the gap, I make the case for API-centric observability: monitoring and observability that focuses around the unit of value for many web apps—the API endpoint.

In this post, I’ll identify what people are missing with purely service-level alerting, explain why many are stuck with it, and show how API observability can make things better.

Why per-endpoint monitoring makes life that much easier

If you have only one important endpoint, then maybe it’s good enough to monitor error count, volume, and latency at the level of the entire service. But in most cases, it’s likely that the view you’re getting from an aggregate monitoring solution like AWS CloudWatch is a lie.

Service-level averages miss endpoints that need attention

A typical API will have some endpoints that are frequently called, and others that see only occasional use. Setting a latency target or error rate threshold for the service as a whole will completely obscure what’s going on with those smaller, but still important, endpoints.

For example, suppose your application calls one endpoint 1000 times per minute, while it only sends 100 calls per hour to another endpoint. The 99th percentile (p99) latency for the service is completely unaffected by the latency on the less-commonly-used endpoint. It doesn’t matter whether it’s 100ms or 30 seconds – that additional latency will not show up in the p90, or the p99, or even the p99.9 of the service-level measurement. Similarly, even if 25% of the calls to the endpoint fail, the error rate for the service will only be 0.04% – perhaps leading you to think that there is no cause for worry.

But that relatively low-frequency API could be a critical part of your business getting paid! Or it could be annoying one of your early-adopter users exploring a new feature. Treating all calls as if they’re the same masks important signals.

Service-level monitoring can have false positives, too

One way to mitigate these problems is to set an absolute threshold for error count rather than a relative one. An alert can go out whenever the service emits a few 5xx responses, even if it’s a relatively small fraction of all traffic.

That’s even the situation at Akita, on our current Datadog-based monitoring. We get statistics from our Amazon Elastic Load Balancer, and trigger an alert whenever the absolute number of HTTP error responses exceeds a small threshold. One of our services doesn’t shut down gracefully, so we will sometimes get an alert on a new deploy, when the service exits without responding to the load balancer. It’s a very small proportion of calls that fail – well within our error budget for this service – but the ELB-based monitoring doesn’t have a way to tell the difference between that and one of the customer-facing services failing due to a logic error in the new code.

We’ve heard similar stories from early users. They have an endpoint they know is bad, but just don’t have time to fix right now. A service-level threshold is either too high to catch real problems, or low enough that a known problem triggers it.

Real-world services show heavy tails and wide variance

I wanted some real-world data that backs up our intuition that service-level aggregates are poor representations of API behavior. I took a look at the distribution of calls to services that are using Akita to monitor their API endpoints. Two of the services have a very large number of endpoints, while another has just a handful. I found the behavior of individual endpoints to be quite different from each other, meaning setting thresholds across endpoints is unlikely to work well.

The two large APIs both show a heavy-tailed distribution, where a large fraction of calls come from less-common endpoints. The graph below shows what percentage of observed API calls were sent to the top 10 endpoints, compared with the API calls sent to all the remaining endpoints. If your API is like this, then focusing on the top 10 endpoints, or even the top 20, is unlikely to result in an accurate picture of how your API behaves. Because each endpoint is a relatively small component of overall API usage, system-level averages will show only the most egregious errors, such as widespread outage or misconfiguration.

The smaller API (the yellow bar in the graph) is dominated by a single API – but, the person responsible for using it tells us that it is not the one he pays attention to at all! Three of the APIs in his service are critical to monitor, but a measurement dominated by the most-common API is not meaningful.

API endpoints vs proportion of calls for a selection of our users.

For monitoring latency, the situation is even worse. Latency is an important metric that both affects user experience, and can indicate overload or latent failures. But, “normal” latency for an API can vary by order of magnitudes between endpoints.

Service-level metrics vs latency for a selection of our users.

The slowest endpoint in each sample had a 99th percentile latency measurement more than 15x larger than that of the median endpoint. The measurements for the service as a whole are driven by the slowest APIs and the most frequent APIs, rather than the ones which are most important or valuable. Valuable information about the quality of your service is being lost if you don’t measure performance per endpoint.

Per-endpoint monitoring treats each API endpoint as an independent source of data. It does not try to apply the same alert threshold to APIs with 8-second latency and those with 80-millisecond latency. It does not smooth out a high error rate in a key API by mixing it with the other 97% of traffic. If your web application is dominated by a single API endpoint, then service level averages are a good approximation of that single endpoint. But odds are your application’s API mix looks more like the examples above. In that case, service-level averages or percentiles are meaningless indicators, mixing together healthy and unhealthy endpoints into an impossible-to-manage average.

The API monitoring gap

If per-endpoint monitoring is the right way to treat services, why aren’t all software teams doing it?

The first problem with monitoring API endpoints is getting the right data. You could instrument every endpoint with metrics. More likely, you’d try a drop-in “magic” package that automatically instruments the router in your web application. Sometimes this works great – but often, the obstacles start to mount.

Upgrading an application, even with a simple one-line change, can be fraught. Often, services span multiple processes and even code bases, so getting a complete picture can mean many changes, not just one, or coordinating multiple teams.

Some applications don’t use a standard web framework or use a less-common language. Or, they use an older unsupported version of the framework and can’t easily upgrade. In some cases even “what are the endpoints” is not clear from looking at the code! These are not best practices, but they are more common in the real world than we might like – and they are the reason so many attempts to roll out API-centric observability and distributed tracing will bog down.

Even at Akita, where we had relatively little code and a relatively modern infrastructure, it was not easy to install Datadog’s “drop-in” APM. I tried out Datadog’s APM tooling on our own Go and Python-based backend; one of the obstacles to getting it working was updating a bunch of Go dependencies that I wouldn’t otherwise have bothered with. Any dependency in common between your application and an APM library is a potential source of conflict, if you are using one version but the APM library needs another. Another downside to this solution was the per-host cost to collect application traces.

But, it gets worse. Once you’ve gotten the right data, your problems have just begun. Your API may have hundreds of endpoints, whose behavior you probably don’t really understand. Writing service level thresholds for all of them is simply not feasible for a team with limited resources and other demands on their time.

There’s a reason people are still using service-level tools like CloudWatch instead of API-centric tools. Sure, these tools only do service-level monitoring, but they are truly drop-in—and service-level metrics are better than no metrics at all. Tools that work as soon as you turn on a load balancer or deploy to your chosen PaaS probably don’t understand your endpoints, but they’re immediately available. An API observability solution has to meet developers where they are, today, not where they’d like to be.

What it takes to bridge the API monitoring gap

API-centric observability must bridge the API monitoring gap in two main ways:

Focus on API endpoints. For an API-centric business, the unit of value is calls to endpoints. It makes sense that developers, especially on small, under-resourced teams, prioritize their development and debugging time in an endpoint-centric way.
Build for developers first. There are companies building API-centric tools with a business analytics or customer tracking focus, for instance Segment and Moesif. But because their main goal is to help sales and marketing teams, developers tend to find themselves caught in the gap between an aggregate tool and a power tool. A developer tool should communicate with developers on their own terms.

Paradoxically, building for development means doing less in software. Over the course of working on Akita, I’ve identified the two factors that I believe unblock a developer-friendly API-centric solution:

The solution has to be drop-in. One of the main obstacles I’ve seen for people (including myself!) adopting LightStep, Honeycomb, Grafana, or even Datadog is needing to make changes to code they did not write. I’ve talked to potential customers and potential hires who have participated in year-long efforts to instrument all their code, and are still not done. Our requirement for the Akita solution is that it should be usable without any code changes at all.
The solution has to work across frameworks. There are nice tools out there that give you all of the information you want on your endpoints—if you buy into a single framework, for instance Apollo for GraphQL. From everything I’ve seen across my career and with our Akita users, this is not realistic. We’ve met many teams who are “in the middle” of a GraphQL migration, or a Rust rewrite, or deprecating a PHP service. Often, the migration has been on hold for multiple quarters. Because of this, it is important to my team and me that we build a solution that works across API protocols, web frameworks, and languages – the ones the team is using now, not the one they hope to be using next year.

While I am working on a solution that meets these goals (and would love for you to try our beta), I hope that you share this vision of what is necessary to meet the needs of current developers.

Drop-in API observability with Akita

At Akita, my team and I have been working on an API observability service that bridges the API monitoring gap. Our solution is a drop-in agent that passively observes network traffic – all traffic, to all API endpoints, no matter what their implementation technology. The Akita agent requires no code changes and no proxies for setup. Because it’s passive, it does not add latency or insert itself into the fast path; the only overhead comes from the CPU and memory that the Akita agent needs to run.

By reducing implementation and maintenance burden, our approach lets developers quickly and easily focus on the API endpoints that are most important to you. Whether they are high-volume or low-volume, high-latency or low-latency, Akita makes it possible to set appropriate API-specific thresholds in a way that’s easy to understand. We guide you through setting a baseline of what normal behavior looks like, show you outliers, and help you group similar APIs together.

In our beta, we’re actively working on better ways to automatically recommend and set thresholds for the hundreds of endpoints you have to manage. This gives you the confidence to deploy knowing that even less-frequently-used endpoints are being monitored, and deviations will be highlighted. Once you’re confident that we’re monitoring the right things, you can start receiving alerts from Akita that tell you specifically which APIs are out of their normal ranges, and by how much, rather than searching through logs and dashboards to understand the reason that a service-level alert was triggered.

If this sounds like a tool that can help you understand your APIs better, we would love to have you join our beta.

‍

Photo by Rebekah Howell on Unsplash.