Oops! Something went wrong while submitting the form.
September 22, 2021
June 16, 2021
Automatically Generating the Graphs that Live In Developers’ Heads
Share This Article
The rise of SaaS and APIs has made it easier than ever to add functionality to your system, but it comes at a cost. Developers are spending more time than ever before reconstructing the context of their code, in order to reason about whether it’s doing what it’s supposed to do.
Here are four important questions that become harder and harder to answer as systems comprise more services and APIs:
Who is using my API?
What APIs am I using?
When something is slow, is it me or one of the services I depend on?
Where did my data come from?
To answer these questions when building features, reviewing code, and debugging, developers need to reconstruct the context of how services interact with each other.
In this post, I’ll first talk about the diagrams that developers are constructing in their heads to answer these questions, explain why it’s hard to automate the construction of these diagrams using most tools, show how passively modeling API traffic makes it possible to answer these questions in a low-friction, low-risk way, and talk about what I’ve been working on at Akita.
For our examples, we used the train-ticket app developed as a microservice benchmark by the Fudan University Software Engineering Lab. We set it up in a three-node Kubernetes cluster. Each node had a single Akita agent running, controlled by a Kubernetes Daemonset, to monitor all the services on that node. With this setup we were able to capture both incoming and outgoing traffic to each of the many microservices. The examples we show below reflect inferences made from the captured API traffic.
The API Graphs that Live in Developers’ Heads
The rise of SaaS and APIs has meant that it’s easier for developers to access functionality, but harder for developers to keep track of how everything is related. In this section, we’ll show the diagrams that developers are constructing for each of the key questions we introduced.
1. Who is using my API?
When adding, changing, or removing functionality, you can be a lot more aggressive if you know exactly who is—or isn’t—using your API. Understanding your API consumers gives insight into which teams need to be alerted to upcoming changes, and can inform a developer when they become a dependency to a new service. In some cases, a developer might use this to encourage their clients to use the API more efficiently or cease using a deprecated API.
For example, in the Train Ticket app, `ts_travel_service` is responsible for answering queries about train routes and trips: which cities will be visited, which trains will be used, and the available tickets. Perhaps we would like to refactor the query syntax in the `travelservice/trips` endpoint to better reflect a downstream change, or abandon it altogether in favor of a new GraphQL query. Many services interact with `ts_travel_service`, but which other services depend on this specific endpoint?
Below, we show an API-centric dependency diagram that would help answer this sort of question.
Understanding the relationship endpoint by endpoint allows a developer to see not just an edge in the “service graph”, but what the actual dependencies are, and their place in the application. Unfortunately, developers often end up constructing this in an ad-hoc fashion, based on a partial understanding of the system.
2. What APIs am I using?
Developers need help understanding their own dependencies too! Even if you’re responsible for maintaining a service, you might not be the original author—or it may have been a couple years since you last looked at it.
For example, we might find that unit test coverage in `ts_travel_service` is not very good, but we’re unsure what APIs need to be mocked, and what the type of requests and responses are. Or, the security team is locking down access to a sensitive service, such as the `ts_order_service` that contains payment information; we need to understand and explain what information we’re collecting.
The reverse of the above picture tells developers what API’s they are actually using—and is something developers end up constructing manually today. Again, an API-centric approach expresses this dependency as particular endpoints, rather than the entire service.
3. When something is slow, is it me or one of the services I depend on?
To debug application slowness, developers often need a picture of the interaction between multiple services. Below I show a sequence diagram I extracted from the train ticket application’s trace, showing all the API calls in a single user transaction, ordered by time.
From the timestamps in this diagram, we can see that the bulk of the time was spent inside `ts_admin_travel_service`. There was a 50 millisecond delay before it made its outgoing call to ts_travel_service, and almost 100 ms after that before responding to train_ticket.
4. Where did my data come from?
If a response is buggy, a developer needs to know which service was the original source of the bad data. Developers may also need to know where data came from in order to figure out how to adequately protect it. Unfortunately, a multi-service environment makes the question of “where did this data come from” much harder to answer.
For example, suppose that manual testing of the Train Ticket app in the staging environment shows that the administrator’s list of purchased tickets is sometimes incorrect. It may contain a duplicate ticket, or a ticket that should not exist. To show how a developer would track this down, I made the following diagram of the service relationships and data flows.
In this visualization, we see that the final response to the user contains an array of objects. Most of the array comes from `ts_order_service`, but the final object came from ts_order_other_service. Perhaps we were only looking at one of the services, but the real problem lies in the literal “other” service, or in the merging of the two responses.
Where Developers Aren’t Getting Help Today
As useful as these visualizations and models are to a developer, today they mainly live in developers' heads, or scribbled on whiteboards, or in a rapidly-out-of-date Wiki page, rather than as the output of an observability tool. In this section, I’ll talk about how there are obstacles to automatically generating these graphs today.
The limitations of instrumentation
Application Performance Monitoring (APM) and observability tools makes it possible to generate graphs that look like this, but require instrumentation beyond what people normally do. For instance, getting these graphs in a DataDog requires adding language-specific configuration or annotations to each service in your applications. Any tool that requires instrumenting your code faces three fundamental limitations:
Need to instrument one service at a time. It takes work to instrument code for a new tool. Even if the work is only a few lines in each service, this starts introducing significant friction as the number of services grows, especially when service ownership crosses team boundaries.
Hard to do with unfamiliar code. Many of the services that developers want to understand (and sometimes quarantine) are legacy services. Instrumentation burden increases when the developer has to figure out how to work with an unfamiliar toolchain or code that everybody would prefer you rather not touch.
Need to control the code. It’s virtually impossible to instrument (and maintain the instrumentation) for code you don’t control. Does your application depend on a third-party service or open source project that hasn’t implemented OpenTelemetry yet? Then your spans will be incomplete.
Logs, metrics, and traces don’t tell the whole story
Even after instrumentation is in place, developers need to have more structure in their heads than what appears in logs, metrics, and traces. For example, in order to fix bugs or resolve outages that have escaped local testing, developers must know the semantics associated with error codes; the relationships between different services and APIs; the data formats associated with similar APIs; the consistency semantics of the services they use; or the cost of performing certain calls. This higher-level structure can be difficult or impossible to make manifest in a local unit test. Most of it can only be learned through experience with the system in staging and production environments.
It’s possible to get raw event information from OpenTelemetry, grouped together in a span associated with an individual user request. This or similar mechanisms provide a base level of visibility into “what was involved in servicing this API call?” But, to generate the kinds of graphs I showed, a developer would need to further instrument the code and/or write queries to process the mass of log, metric, and trace data that is produced.
Manual developer understanding doesn’t scale
Keeping everything in a developer’s head (kind of) works for a handful of services. But what if you have dozens of services? Hundreds of services? When the nice, simple sequence diagrams above start looking like this? And things start falling apart a lot more quickly when your service calls out to third-party services.
Because of the work and coordination involved to generate these graphs using existing instrumentation-based observability tools, developers are keeping track of these relationships by hand, often in spreadsheets that they fill out by talking to each of the other teams involved.
Towards Automatically Generating API Graphs
Developers need tools that help them understand and externalize the structure of cross-service communication, in order to continue scaling the number of services and APIs.
To help developers scale their code better, I’ve been working on an alternative, API-centric approach to instrumentation-based observability. The approach involves passively monitoring API traffic in order to infer the structure and properties of the API graph automatically, without needing to instrument the code. On our team, we call our approach API modeling.
In addition to being able to automatically generate the graphs I showed in this post, inferring API models from traffic has the following advantages.
Inference beats instrumentation
The starting point for building an API model is to capture traces in a non-invasive fashion, so that all your services can be monitored from day one. This is important because even in the case where all your services are theoretically under your control, the reality is that organizations have many different teams, each with their own priorities. Any new feature—including improved observability—needs an incremental deployment plan. Getting some “B+” insights from a model today is enormously better than a planned “A+” upgrade nine months from now.
While network APIs don’t show everything, they do reveal key information that isn’t available to single-process tools like static analysis. And with API modeling, you get quick visibility where visibility is effectively impossible to get otherwise, for instance in legacy and third-party services.
An API-centric view lowers the barrier to entry
Today’s APM and observability tools are largely expert tools when it comes to understanding and debugging your systems. With OpenTelemetry, for instance, the developer defines “spans” corresponding to a timed operation. If you know what you want to track, spans give you a lot of precision and control. If you don’t already know what you’re looking for, it’s hard to find it.
API models, on the other hand, impose the structure of API endpoints to help make sense of the mass of information in an API trace. The most basic level of information is what endpoints, data formats, and response codes are in use. Inference makes it possible to additionally organize raw API calls into workflows, dependencies, and data flows. An OpenTelemetry-based tool combines events from related spans into a call tree and timeline view; analogously, an inference tool builds a structured API graph from the collected API traces. API models make it possible to quickly answer API-centric questions, inferring context that a developer would otherwise have to bring to the job themselves.
Automatic inferences helps with scaling
As the number of services and endpoints increase, teams are forced to scale in two dimensions. First, they must face a potentially quadratically increasing number of cross-service interactions, stretching the boundaries of what each individual engineer can track. But also, they must handle the challenge of communication among a growing team, with knowledge split across engineers with different specializations. API models address both to help scale human processes by allowing developers to more easily work with large numbers of APIs.
API models make it possible to capture this complexity, automatically generate graphs of complex systems of APIs. But they also help with understanding that complexity by breaking it into manageable chunks. An explicit API model that transparently displays endpoints and data types makes it possible to highlight what’s worth paying attention to in models by focusing on what is novel. Working with event queries or opaque machine learning outputs can obscure these differences or fail to connect with the concepts a developer’s already familiar with, or bury them in a sea of data. With API models, it’s possible to highlight new edges in the API graphs, or new behavior (for example, sensitive data, more data, or higher latency) along those edges. Passively modeling API traffic makes it possible to use these models to compare behavior across test, staging, and production. Having explicit, transparent models makes it possible for users to add their own annotations to models, making them a repository of shared knowledge.
Generating API Graphs with Akita
As I alluded to, on my team at Akita Software, we’ve been working on a new kind of API-centric observability tool. Akita builds API models from traffic based on observations from test, staging, and production environments.
Deployment. Akita works through deploying passive, network capture-based collection agents to simultaneously monitor all the deployed services. For instance, you can deploy in CI/CD or in your K8s clusters. The Akita CLI collects request/response information in order to infer relationships between services. This approach scales naturally to a large number of services, improves with time, and helps your team monitor applications at the API level.
Inference. The process of inference works by identifying API calls with matching characteristics, such as network addresses, timestamps, and payload. The Akita inference algorithm examines possible models against the available API traces to see if they are consistent. The resulting model is the “simplest” one that explains the observations. This allows Akita to handle concurrent and overlapping traffic.
The resulting models serve both as a reference for developers—who may lack any formal API documentation—and as a way for Akita to alert to changes in the inferred model. Models are concrete, usable artifacts that are available on day one of a deployment.