When Developer Experience Depends on the Tech

by

Jean Yang

“What hard problems are you solving at Akita?” is the question I get asked the most frequently.

Many people, maybe because I was previously a professor of Computer Science, expect that we’re mainly solving technical problems. People are often surprised that this is not the case.

As we’ve been building out Akita’s first product, our goal has been “one-click” observability: helping developers understand the behavior of their software systems with as little effort as possible. This involves answering the following question: what do developers want to know and how should we show this to them?

Here’s the hard part: how we build our technology depends on what our users want, but what our users want depends on what we’re able to deliver technically. And there's no playbook for iterating on product at the same time as deep tech. At Akita, we made “no code changes” a goal after learning about obstacles in adopting existing observability tools—and this gave us technical challenges in spades. But when we’re watching API traffic to automatically infer properties about system behavior, there are things the inference can and can’t do well—and that is at the core of our developer experience challenges. And the only way to test things out is on real, live user data with real, live user feedback.

I initially wrote this post because many people who are thinking about working with us ask this question. I believe that the challenges we face are at the heart of why it’s often hard to build good developer experience for “deep tech” developer tools, so this post may be of interest for anyone interested in this question.

Framing the one-click observability problem

The hardest part of working on Akita’s product, similar to most hard problems, is defining the problem in the first place.

Since the time I was a professor, I’d been closely watching the tooling gaps caused by the rise of microservices and APIs. The fact that so much computation went over the network meant that the kinds of techniques I’d been working on—programming language design, static code analysis, and dynamic code analysis—had smaller and smaller scope in where they were effective. The fact that so much computation went over APIs meant, to me, that APIs was where we should be focusing the tooling.

When I first started thinking about starting a company, I thought the focus was going to be on enforcing system properties across API boundaries. But to validate this hypothesis, I went through my network and called everybody who would talk to me. What I discovered, across engineers, engineering leaders, and security/compliance teams was that there was something higher priority than enforcement: understanding what they were running in the first place. I had user-interviewed my way into the observability space, where there were clear gaps.

By the time Akita became a company, our mission was clear: help developers better understand the behavior of their system behavior, by making observability more accessible. How we were to do this remained to be determined—and has depended on a combination of user research, combined with technical experimentation. Read on for what this looked like.

Watching systems without instrumenting code

After framing the general problem of improving observability, the next question to answer was how we were going to observe systems.

Observability tools need to get logs, metrics, and traces out of systems somehow—and there are many options. For instance, you can integrate via an instrumentation scheme like OpenTelemetry, through eBPF, with containers, with service meshes, and with API gateways. Eventually, it makes sense to integrate across many, if not all of these, but small teams have to pick one or two places to start. For context, many other observability tools require developers to do quite a bit of instrumentation work, but are “power tools” in that they give you fine-grained control over visibility in exchange for lots of up-front work.

From our conversations with users, we learned that people wanted more observability for less of the cost. That is, the users showing up to our product preferred “B+ on Day One” solution to an “A+ eventually” solution. Either they had too much legacy code to adopt existing observability tools, or they were not able to convince their organization to get onto the bandwagon of instrumenting all of their code. What they were asking us for was a solution that was as noninvasive as possible.

So our challenge to ourselves became this: how do we watch traffic without requiring developers to significantly change their systems? Initially, we considered integrating with service meshes or API gateways, since those are at the infrastructure layer and would not require changing code. But there has not been as much standardization as we would like with either service meshes or API gateways, meaning that on the Akita side we would have to build many, many different integrations to be able to integrate seamlessly with users—meaning it would take a very, very long time to make observability accessible this way. And because there are still many users (including our own!) who don’t use any service meshes or gateways, we’d be leaving out many users.

This is the point where technical experimentation came in. There were not many solutions left that did not require instrumenting code and did not integrate with a service mesh or API gateway. We considered two: pcap and eBPF, ultimately deciding on pcap because of greater accessibility if and when we got it to work. With pcap, our agent gets access to watch all API traffic as long as the user gives it access to libpcap. As evidenced by our many blog posts about how we got this to work, getting this kind of passive monitoring to work, to work transparently, and to work at scale requires some clever engineering, as well as some back-and-forth with users and ergonomic usage. Today, a developer can simply install the Akita agent, give the appropriate access, and start watching traffic going across the ports of interest, or all traffic on a given Kubernetes host.

Making sense of network traffic

Once we made it easy for developers to start running Akita, we had a new problem: what were developers supposed to do with all of this traffic information? This, again, involved a combination of collecting user requirements based on technical limitations discovered based on user requirements.

Our first order of business was to give users something more actionable than raw network traffic to look at. This led us to another place user constraints intersected with technological constraints: modeling the API traffic. There were two questions to answer: what to model and what environment(s) to model it in? What we want to model depends on what users want to look at—and our users were the first to tell us that what they wanted to look at depended on what we could do well. Whether to model the traffic in test, staging, and production depended on what we wanted to model, as well as where users actually had traffic.

Users were asking us to automatically infer structural information about APIs, so we started out by automatically inferring API specifications containing information about endpoints, path arguments, data types and more. It had initially been unclear how good our inference could be on different kinds of properties, as we had to build out new inference algorithms to support this. Our strategy was to build it out and iterate with users. We got the feedback that structural inference was helpful, since people could now understand at least what endpoints they had and export automatically generated API specifications for use with other tools, or home-grown tools.

Through building out the solution, we also learned about where to model these properties. In the beginning, we were all about “shift left” and moving system understanding into CI. As the rubber hit the road on integrations, however, we ran into a lot of CI snags. Maybe the traffic didn’t go across the network. Maybe there just wasn’t enough traffic. As more and more users started running Akita across their test, staging, and production environments and giving feedback, we’ve gotten shifted right, hard.

And once we started getting lots of traffic to run our product on, that highlighted another problem: how do we help users digest all of this information? It turns out that, as we had suspected, what developers currently understand and document about their systems is a tiny fraction of all of the traffic that’s actually flying around. While developers definitely preferred getting automatically generated specs to looking at the raw traffic, the feedback we got was they realized they didn’t want to be looking at a sea of endpoints at all. Developers don’t want to know everything: they want to know what matters.

Learning that our users wanted to run us in staging and production let us to build out our product to give more support to both the sheer amount of traffic we get from staging and production, to ergonomically managing the traffic, to exploring how this traffic changes over time in a way that doesn’t make a user’s head explode. We’re now in the process of working on the parts of the product that collect information about API usage in addition to structural information, using that information to help users prioritize what endpoints they see, and showing how everything changes over time.

Helping developers understand their systems

At this point, we’re feeling pretty good about our no-instrumentation way of helping users observe their systems and our methods for modeling API traffic to answer different questions. And while we’re very interested in shifting left again after some time, we’ve decided to focus on staging and production traffic for now.

A lot of the hard questions we’re dealing with today have to do with how we should present users with insights about their system in a way that’s useful, so they’ll keep trusting us to give them reports. Some of the questions that have come up that users want to answer include:

Where are all the places I’ve seen ‘email’ data type and how is this changing over time?
Do I have any anomalously slow endpoints?
What are the services that are causing my ‘get_user’ endpoint to be slow?

Again, we are iterating across a circular dependency: how we want to show users data depends on what we want to show them, what we want to show them depends on what they want to see, and what they want to see depends on how good of a job we can do showing things to them! For a product like ours, there are additional considerations to take into account when presenting data, for instance:

We’re automatically inferring properties about an API based on traffic we observed. How do we communicate confidence in the properties, based on traffic observed? What algorithms can we develop to infer properties of interest better?
We’re not always watching all traffic. Rather, we’re often subsampling. How to help the user understand how representative the traffic is? Can we build better subsampling algorithms based on what users want to see?
Sometimes we want to automate tasks for the user and sometimes we want to give them the tools to explore the data in order to come up with the insights themselves. When to do one versus the other?
When we see an issue, how do we want to present the “evidence” that it might be something a user should pay attention to? How good of evidence does the technology allow us to present?

In some sense, every system based on artificial intelligence has to contend with questions like these. But we’re building the Akita product to help users explore a lot of data—and so we need to be careful to expose a good amount of control in exploring that data.

This aspect of our product R&D is the one with the most open questions. We’re actively working on this—and actively looking for people to join us in working on this!

Iterating in the open

There are two ways to build a product. There’s the Steve Jobs way, where you climb into a cave with your team and emerge two years later with the perfect product. Then there’s the way where you put out the smallest product possible and just keep iterating until it’s the right product.

The kind of product we’re building at Akita is especially tricky because we’re simultaneously co-iterating on the product and the underlying technology, while testing on real user traffic the whole time. (Shoutout to our community! We’re extremely grateful for the users who are putting in the work and using Akita before the product is fully articulated.) Had we framed the problem solely as a technical one in the beginning, we would almost certainly have taken an approach based on instrumenting code and not settled on “one-click” observability as the goal. Had we framed the problem solely in terms of what we were able to do with design mocks, we would either have been much more conservative in the underlying technology and not developed a way of automatically processing API traffic—or designed a product that was unbuildable. Iterating on the technology and product in tandem hasn’t always been easy, but we believe it’s cut years off getting to where we want to go.

We hope this blog post is helpful not just to anyone who is thinking about working with us, but for other teams who are building deep tech that they want people to use. We haven’t seen a lot of other people talk about how codesigning deep tech with developer experience—and would love to hear more from the teams that do this. And if the problems we’re solving sound like fun to you, we’re hiring!

Photo by konderminator on Flickr.