October 13, 2021
July 13, 2021
·
10
Min Read

Monitoring Akita’s Services With Akita

by
Mark Gritter
Dog watching another dog on the beach
Share This Article

At Akita, we’re working towards the goal of one-click observability: being able to drop in across your systems and start telling you about what’s going on-- without requiring developer effort to instrument service by service, or attempt to build the big picture by querying logs and traces.

A big part of one-click observability is the ability to watch API traffic across test, staging, and production environments. Towards this, we’ve been improving our capabilities for continuously monitoring a service in staging and production—and of course we use those capabilities ourselves! We run Akita in Kubernetes (using Amazon EKS) and run the Akita CLI on each host to capture traffic to and from all of our internal services. We’re hoping to use Akita to help us keep track of our endpoints and data structures, monitor which operations are heavily used or slow, and catch breaking changes across our services and our public-facing API.

In this post, I talk about our setup and experience monitoring Akita’s own services using Akita. I’ll show how we set up Akita, how we automatically generate API models from our staging and production environments, and the questions I’m able to answer from our traffic.

Setting up continuous monitoring

The Akita agent works by passively watching API traffic in any environment where it can get access to traffic. I’m going to show how to drop Akita into a Kubernetes cluster, to see all the APIi endpoints without instrumenting every service.

We use a Kubernetes Daemonset, a controller that ensures that every node in our cluster is running the Akita CLI. The documentation in Capturing Packet Traces in Kubernetes shows how we set this up, and I’ve posted a snapshot of the Daemonset configuration as this Gist.  Each pod running the Akita CLI is run in host networking mode so that it can see all network traffic to other pods on the same node. The CLI captures a trace for an hour and then exits. Kubernetes automatically restarts the pod to begin a new trace. This trace contains obfuscated versions of every HTTP request and response.

Akita has two clusters, “staging” and “production.”  The template specification in the Daemonset sets the AKITA_DEPLOYMENT environment variable. This causes the CLI to tag the resulting trace for special handling, and mark it with the environment it came from.

We also add other tags, including one that says which version of our mono-repo was used to build the deployed software. That will enable us, in the future, to associate changes in the API behavior with changes in the code.

After seeing the first few traces, I added `--path-exclusions` to remove some very common endpoints that were not very interesting. These are the liveness checks made by Kubernetes, which occur every minute. Filtering is done before rate-limiting is applied, so removing these ensures that our “budget” number of requests per minute is spent only on meaningful traffic.

Checking the traces

Once I’ve set up the Kubernetes Daemonset to run Akita, I can use `kubectl` to see that the pods have been created, and check their logs. I could visit the Akita dashboard associated with my service and see that traces have been created. But I might want to check in more detail that things are working as intended.

The recently introduced `akita get` command lets me check that the traces are being generated, and see all the tags that have been attached.  The command, by default, lists the ten most recent traces, though I could filter by tag or get a longer list:

Listing recent traces with `akita get`.


Automatically Creating API Behavior Models

One of the ways Akita helps with one-click observability is by building API behavior models from API traffic. You can think of an API behavior model as how a compiler or debugger might represent a program: it includes structured information about API endpoints and types, as well as per-endpoint performance information.

Every six hours, Akita automatically creates an API model from all the traces with a matching deployment name (the one specified in the AKITA_DEPLOYMENT environment variable, and stored using the x-akita-deployment tag). This model covers the past 24 hours, so it’s a rolling window into the API’s behavior. Under the “Staging/Production” tab, the Akita web console lists all of the models associated with a service.

In our configuration, we collect traces every hour. Keeping the traces smaller (just one hour at a time) helps us if we want to download one for debugging purposes. An alternative solution is to create a single trace for the lifetime of the Kubernetes pod that is running the agent (by omitting the `-c “sleep 3600”` option) and extracting time periods to look at using the `--from-time` and `to-time` flags for the `apispec`. We are incrementally building the flexibility to build a model of any desired time period, regardless of how it was initially collected.

In the screenshot below, I’ve filtered the list to only include the models created on the 6-hour schedule, using the tag `x-akita-created-by=schedule`. This excludes, for instance, manually created models.

Managing staging/production models on the Akita console.

The table under “staging/production” shows the time range covered by each model, the deployment name, the number of API endpoints discovered, and the number of events that were used to create the model. When the “Events” column has two values, it is because some of the captured events were not useful for the model. For example, they were filtered, or contained only one half of the HTTP transaction. Internally at Akita, we send production traces to the staging environment, and staging traffic to the production environment, so that we can use features like this in staging before releasing them.

In addition to these rolling 24-hour models, we also create one aggregate model for the entire lifetime of the service. This can be useful for finding endpoints that may not have been used in the past day. In the future, we’ll have more sophisticated ways of selecting the time range you want to see, or resetting the aggregate model. (Today, you can delete it, and Akita will start over with the next automatically created model.) The aggregate model always appears on top of the list of staging and production models:

List of models including the aggregate model.

You can see in our example that the aggregate model contains a lot more endpoints than the most recent one!  Some API calls are uncommon; the aggregate also contains a bunch of endpoints that were filtered from later traces.

Exploring Akita’s APIs

Now I’m going to show you a couple of things you can do when exploring your API models, including:

  • Looking at all of your endpoints.
  • Drilling down into an endpoint with specific characteristics, in this case a curious response code.
  • Searching for unauthenticated endpoints.

First, let’s look at the “one big model” of all traffic over time. Because we set up the Akita agent to capture from all services, we see endpoints from each of Akita’s services.  The API front-end, the “integration gateway”, and our “King’s cross” service endpoints are all visible in this aggregate model:

The Akita web console's list of Akita's endpoints over all time.

The “delete service” API is visible here in the aggregate, even though nobody has used it in the past 24 hours as I’m writing this post.

Now let’s look at the most recent 24-hour model. In this model, a user can see not just the list of endpoints, but also the number of observed calls to each of them, and a 99th percentile latency measurement. The filters at the top of the page show which data types, authentication methods, and HTTP response codes have been observed. Akita infers all of the information contained in these summaries by looking at your API traffic, without needing additional annotations.

A closer look at one of our API models.

For example, 409 is an unusual error code. When I select it as a filter, the Akita console lists the endpoints which had returned that error, and brings the relevant fields in the request or response to the top of the endpoint’s description.

Filter on endpoints with 409 response code.

In the picture above, you can see that the fields in a 409 endpoint are much different from the normal 201 responses. The 201 responses have a service id, a name, etc., but the 409 response only has a request id, a message, and an error code.  In this case, the 409 means that a learn session with the given name already exists; the response includes a plain-text message stating that, but no details of the existing trace session. This is an example of seeing the details that are available in each specific return code-- which is important in understanding what a user of the API will have available.

I am also curious about what authentication methods are used, and for which endpoints. The summary view tells me that both basic authentication and bearer authentication are used to access this API-- which I hadn’t known before!  My coworker Cole tells me that the CLI uses basic auth, while the UI uses bearer auth.

Authorization fields of one of our endpoints.

How about unauthenticated endpoints?

Looking at which endpoints have no authentication.

This tells me that the internal services are unauthenticated (which I expected), but that the webhook from our GitHub integration also lacks an authentication header! Is that something we need to fix?

We’re covered in this case; a closer look at the headers of the request shows that it’s using X-Hub-Signature-256, as recommended in the Github documentation:

Digging in to see the the endpoint is using X-Hub-Signature-256 for authentication.

This sparked an internal to-do: begin listing that header as one of the recognized authentication methods. If the Github webhook was not authenticated, or if I found other external endpoints being used without authentication headers, that would be something I should immediately fix.

Something else you can do with API models is take a diff between two of the generated models, to see if there are any changes over time.  For example, we can see that the GitHub webhook has started including a lot more information than previously. Some Akita-specific headers are now absent; I suspect they may have been a result of traffic from an internal testing tool rather than being sent from GitHub.


Diffing API models to identify breaking changes.

And this is just the beginning of what you can do! My team and I are working hard every day to add capabilities to better support your API-centric observability needs.

Conclusion

I’ve shown how we use Akita on Akita to model our own API endpoints. We set up continuous monitoring in our Kubernetes environment. Marking those traces as coming from a deployment environment kicks off automatic model creation every six hours, and an aggregate trace covering the entire history of the service. Additional tags help us identify the source of traces and look at models covering particular versions of the software.

Exploring the API is easy: we can find a particular endpoint and look at its request and response fields.  We can filter by response code to find unusual cases, or by authentication to find unauthenticated endpoints. We can diff between two continuously-built models to see what’s changed over time.

We’re actively building more features that build upon these capabilities. In the future, we’ll be able to show which services are calling which endpoints within the cluster, and show changes in that behavior as well.  You will be able to search across all your services for a particular header, field, data type, or endpoint name. But the starting point is just being able to see the truth about what endpoints actually look like in your production and staging environment!

If this sounds useful to you, I’d love to have you sign up for our private beta and give feedback on how we can better help you answer your questions.

With thanks to Jean Yang for comments. Photo by @Nadezhda.

Share This Article

Join Our Private Beta now!

Thank you!

Your submission has been sent.
Oops! Something went wrong while submitting the form.