Oops! Something went wrong while submitting the form.
April 7, 2021
API Observability: The Way Out of “Testing” in Production
Share This Article
In early 2018, I was a professor at Carnegie Mellon when I noticed something curious.
“We try to be good about integration testing, but a lot of our problems don’t come up until production,” developers would tell me.
Even for the best-tested applications, developers are not sure if a small change will take down their site until they run in production. And once code hits production, tracking down what caused an error often involves playing an indefinite game of log detective.
With the rise of APIs and service-oriented architectures, tools that help developers understand code no longer helps them understand system behavior—and tools that help understand system behavior don’t help understand code. Formerly monolithic applications are now broken up into services that talk to each other across network APIs, including SaaS APIs like Slack and Stripe. Code-level tools (for instance static analysis, testing tools, and IDE integrations) have diminished in scope. While observability and monitoring tools help developers navigate network logs and traces, these are like the clues to solving a mystery, but it’s up to developers to put the pieces together about how services are talking to each other.
I left CMU to start Akita because I saw the need for a new kind of developer tool, one that puts the clues together for you about how APIs are talking to each other. This blog post is about our vision for API observability, why inferring models of API behavior is the way to achieve this vision, and the work that we need to do to get from here to there. If this sounds interesting to you, we’d also love to have you join our private beta.
A better way to program accidental distributed systems
Before everything became distributed, heterogeneous, and cross-organization, you could rely on IDEs and debuggers to help you explore function signatures, stack traces, and call graphs. What if we could also have this level of visibility and control when working across services?
Imagine this: your application goes down. You look at the API call stack to see what API calls happened leading up to it. You cross-reference with the fine-grained API call graph to see which calls are relevant. You then use the automatically maintained registry to see which new API endpoints, fields, and data types got recently introduced. This helps you quickly reproduce the bug. Once you figure out which API change was responsible, you’re able to look up the pull request that introduced it, figuring out who is responsible for addressing the problem.
This is what we call API observability. Being able to do all this also means we can tell developers about potential problems with their APIs even before running in production.
What API observability means to us
To support what I just described, we’ll need a way to understand endpoints, fields, data types, and properties like expected data formats and how fields relate to each other. We’ll need a way to understand how services call each other across APIs. We’ll also need a way to capture and convey this information in a structured way, as early in the development cycle as possible. Here is what is necessary for achieving these goals.
We need a way to connect code changes with system behavior changes
If I change this phone number from US to international format, will it cause a cascading failure? Today, the simplicity of pre-production environments makes it hard to predict this—and the complexity of production deployments makes these failures hard to root cause. Connecting code changes with observable behavioral changes will help developers get assurance about their code and fix bugs more quickly. The earlier in the software development life cycle it’s possible to do this, the better.
We need to automate understanding of cross-service behavior
Where does your checkout service send a credit card number after it receives it from your payments service? What’s the rate you can send credit cards without causing a shared data store to fall over? Pre-production techniques like static analyses and tests don’t answer these questions. Monitoring and observability provide the logs and traces to answer these questions, but the developer is again left solving some mysteries. (Does this call to the checkout service depend on that call to the payment service, or were they unrelated calls that happened at around the same time?) Automatically up-to-date service interaction maps will help the developer answer these questions far more quickly than they can today.
We need to be agnostic to language or toolchain
Both a software organization’s needs and the available tools change over time. As a result, any software system of sufficient maturity is a conglomerate of different programming runtimes, data stores, and infrastructure components—many of them blackbox, with source code not available. This was true for programming languages and will continue to be true for API protocols (for instance gRPC and GraphQL), as well as infrastructure (for instance Envoy and Kubernetes). A good end-to-end API modeling solution needs to work across tech stacks.
The missing piece: automatically inferring API models
After considering many different approaches to this problem, my team and I found what we believe to be the best way to automatically understand service behavior while being agnostic to language or toolchain.
The approach? Automatically inferring API models by watching API traffic.
Akita’s API models contain structured information about endpoints, as well as the communication graph across endpoints. The key to the API models is that they abstract over values in raw network traces, for instance inferring path arguments and data formats (such as timestamp, unique ID, and date/time). This makes it easier to find the behavioral changes that matter, both in individual models and in diffs across models. Our flexible approach makes it possible to watch in test, staging, production, or all three.
API models help connect code changes to service behavior changes
Generating API models makes it possible to tie code changes to behavior changes. Developers can, for instance, compare API models generated on each pull request to identify what API behavior changed. Incomplete tests? No problem! Generating API models from limited test environments provides a good foundation for understanding production API models. And the more tests you have, the more insight you’ll get back about potential breaking API changes.
API models help automate understanding of cross-service behavior
The structure of API models, along with the fact that they’re based on watching traffic, make it possible to fill out the cross-service graph one service at a time. This is pay-as-you-go: the more services that get added to the graph, the more easily a developer can predict the impact of code changes before running in production—and the more quickly developers can root cause production incidents.
Inferred API models should be agnostic to language and toolchain
Being able to infer these models by watching traffic means that developers don’t have to buy into a specific language, framework, or interface description language in order to get models.
We’re just at the beginning of helping developers put together the pieces towards the fine-grained API graph! If you’re a developer who has felt the pain of programming in modern web environments, we’d love to have you join our private beta and build the future of API observability with us.
With thanks to Will Crichton, Nelson Elhage, Kayvon Fatahalian, Cole Schlesinger, Mike Vernal, Hillel Wayne, and Quinn Wilton for feedback, as well as all participants in my Twitter poll for input on the title of this post.
Share This Article
Your submission has been sent.
Oops! Something went wrong while submitting the form.