Building Observability for 99% Developers

by

Jean Yang

In the last few years, I’ve heard the same statement over and over again from software teams: “We try to be good about testing, but…”

Teams are often embarrassed to tell me. They think they’re alone.

Over the last few years working on Akita, I’ve talked with hundreds of software teams. It turns out that most of them don’t have a good handle on what their code does until production. And even then, their understanding often isn’t as good as they would like.

These are often competent, responsible teams who care about their customers and are making reasonable decisions. How did things come to this?

In this post, I’ll talk about how ̶o̶b̶s̶e̶r̶v̶a̶b̶i̶l̶i̶t̶y̶ system understanding has become a necessary part of developer toolboxes, where existing solutions are leaving behind most developers, and the approach we’re taking at Akita. I’ll also talk about what I’ve learned that led me to stop using the word “observability” and instead use the term “system understanding.”

This is the blog post adaptation of two talks I’ve given this year, one at DockerCon and one at Strange Loop, originally titled “Building Observability for 99% Developers” (and featured in these DockerCon highlights), eventually titled “Building ̶O̶b̶s̶e̶r̶v̶a̶b̶i̶l̶i̶t̶y̶ Software Understanding for 99% Developers.”

‍

What the breakdown of the SDLC means for you

To give you an idea of where I’m going, let me first tell you where I’m coming from.

Before starting Akita, I was a professor of Computer Science at Carnegie Mellon University specializing in programming languages research. The goal of my research was to help software developers build more reliable and trustworthy software—by ensuring that code did what it was intended to do. My field focused on techniques for proving code correct against specifications and designing programming languages with correctness-by-construction guarantees. (Think strongly statically typed languages like Haskell or even Idris.) For years, I dreamed of getting mathematical guarantees on as many lines of code as possible.

Then came the cracks in my reality, the obstacles to code correctness:

The sheer volume of code in the world. There’s a graphic I’ve used in many of my talks from Information is Beautiful showing how Google Chrome, the amount of code needed to repair HealthCare.gov, and Microsoft Office 2001 all have well over a million lines of code and have more lines than War and Peace x 14 and bacteria. Correct-by-construction guarantees on all new code that’s written does not help us with all of the legacy code!
The rise of microservices. Language-level guarantees get subverted by network calls. This is because a static analysis, program verification, or type system needs to assume that network code can do anything, making it much harder to make strong guarantees. The rise of microservices means that there are now network calls all over the place.

The rise of microservices in the last decade, as shown by Google Trends.

The rise of SaaS and APIs. The rise of SaaS and APIs means that not only do most software systems have many software components calling out to each other across the network, but also that they’re heterogeneous and blackbox. It is becoming increasingly hard to expect to be able to reason mechanistically in a principled way about web apps.

Even for people who didn’t have the goal of proving all code in the world correct, it’s undeniable that software development has permanently changed. Today, software systems (and tech stacks) are much more like rainforests than they are like planned gardens: components organically evolve and interact without top-down intervention. The rise of APIs and SaaS make it easier than ever to quickly add functionality to a system. Because implementations depend on how the components fit together, it’s harder than ever before to predict software behavior by looking at individual services. As a result, “testing” now often happens in production—and not out of regard for the end user. Similarly, “intended behavior” is based on observed behavior rather than a specification that software teams write before implementation.

‍

The software development life cycle, which used to have “planning,” “analysis,” and “design” as three up-front stages before implementation, is now often condensed into a cycle consisting primarily of development and testing. Similarly, many of the behaviors we, as an industry, consider “best practice” have gone out the window. We developed our idea of what software engineering should look like in a very different time.

In thinking about how to update our software practices, I came to believe in what the observability folks were saying about embracing “testing in production.” Here’s a great quote from Charity Majors, the CTO of the observability company Honeycomb:

“Once you deploy, you aren’t testing code anymore, you’re testing systems—complex systems made up of users, code, environment, infrastructure, and a point in time. These systems have unpredictable interactions, lack any sane ordering, and develop emergent properties which perpetually and eternally defy your ability to deterministically test.”

Charity’s main point is that because so much happens outside of the code, we need to invest in tooling for getting a better understanding of our production systems.

The observability gap for 99% developers

Mechanisms for understanding production systems today include using monitoring solutions, which help with identifying when issues might be occurring, and observability solutions, which help developers understand what their systems are doing in order to more easily catch issues early and debug faster. I won’t go over what observability is here because there are plenty of other, better articles describing the core philosophies of that. Here’s what I will say: today’s monitoring and observability tools are still leaving developers behind.

When I first started figuring out the best tool to build for developers, I heard over and over again that these teams often don’t know: what their endpoints are or when something bad happens with one of their endpoints. Initially, this surprised me, since my understanding was that these teams were using different solutions for logging and monitoring. The more I dug in, however, the more I discovered that managing logs and monitors can often feel like the days of punch card programming, when everything was manual and you needed to really know what you were doing to not get lost in the details. Even for expert teams using state-of-the-art tools, good visibility into where issues may be arising can be elusive.

What I also discovered is that there are a lot of developers outside of the main voices we hear from on monitoring and observability. I first discovered these developers, the “99% developers,” because they showed up to the Akita beta: they were outside of FAANG (Facebook, Amazon, Apple, Netflix, and Google), often outside of Silicon Valley, working for companies that often valued making money over building tech. The environments and needs of these developers are often very different from teams who have fully-loaded developer productivity teams, who can afford to decree a uniform tech stack. (If you want to read more, I’ve written another article about this.)

Stepping back, today’s observability solutions often provide great “power tools” for expert teams. If you know what you’re doing, you can use these tools for a really nice experience understanding what’s going on across your services. This is great for the most senior engineers on a team who need to debug 99th percentile tail latency across a microservices architecture. But if you don’t know the ins and outs of your system, you’re not an expert in the tool, or you want information out of your system you didn’t predict, things get harder. As more and more developers face evolving tech stacks and complex architectures they didn’t build themselves, today’s observability solutions serve an increasingly small fraction of software teams. Accessible observability solutions need to work as easily widely as possible across heterogeneous systems, even for developers who inherited a mess of a system.

Building observability for 99% developers at Akita

Based on what we learned about developer needs, we set ourselves the following challenge at Akita. How much can we tell a developer while asking for as little as possible? We wanted to see how far we could get by requiring no instrumentation of the code and no custom dashboards.

To address the “no instrumentation” part of our goals, we focused on the API layer, since API behavior is easily observable without needing access to code or code changes. Based on this observation, we built a solution for passively watching traffic, based on network packet capture via Berkeley Packet Filters. (We have a blog post on how we use GoPacket, if you’re curious.) Users simply needed to give the Akita agent access to watch traffic (through libpcap, for the technically inclined). The first iteration of our tool was a solution that watched all API traffic across networks and showed the endpoints that it saw. You can think of it as Wireshark, but with automatic API endpoint inference.

What we quickly learned, however, was that watching traffic alone doesn’t solve the problem. When we first started showing our users all of their traffic, we quickly discovered that dumping all traffic was not helpful to our users. They were coming to us to become less overwhelmed, not more. Especially since we were showing them information they weren’t logging, they needed us to provide more guidance about what to pay attention to.

This, dear reader, is where my story wraps back around to what I was doing before Akita. In my past life, I had been focused on showing that code will behave according to a specification that a human wrote. Invert the problem and it’s possible to reuse some of that intellectual machinery for inferring specification from behavior. Today, our endpoint inference builds a model of an API and analyzes API traffic the way a compiler might analyze code, in order to build models of API behavior. In the future, I’d love to infer all kinds of properties about system behavior, from performance invariants to API behavior invariants. My new dream: be able to drop into any system and be able to tell a developer useful properties about the system and how it changes.

Towards this dream, we’re starting by building the most dead simple, per-endpoint API monitoring. It turns out that endpoint inference unlocks all kinds of drop-in system understanding, including per-endpoint performance and error monitoring. We’re just at the beginning of summarizing what we’re able to learn from this—and excited to help users understand their service behavior better. (Check out our docs here and join our beta here.)

What I’ve learned since the first time I gave this talk

Almost all developers are “99% developers.” When I first started talking about the problems of “99% developers,” I thought of them more similarly to how Scott Hanselman describes them in his blog post about “Dark Matter Developers,” as quietly working for companies that you haven’t heard of, using technologies like ASP.NET and VB6. As I started talking more about the problems of the “99% developers,” more and more developers told me their stories. Legacy subsystems, high growth, and work by third-party contractors are among the myriad of reasons why even hot Silicon Valley unicorns have software processes that look quite different from what developer-influencers are talking about.
“Observability” is often perceived to be a luxury good. When I first gave this talk, I thought of what we were doing as democratizing observability. This framing caused us to attract many more observability enthusiasts, rather than the underserved developers we were hoping to reach! Over the course of the last few months, we realized we were facing the following problem: few software development teams are happy with their monitoring solutions, let alone their observability setup. Teams associate a certain amount of work that they need to be ready for to even set up monitoring—and they’re often not ready to start on their observability journey until after that. We realized that, even though we were providing system understanding at lower cost than many monitoring solutions and that the understanding was what improved the monitoring, framing it as “observability” was not helping us.
Most software teams need dead simple monitoring and observability solutions. In my blog post “From Balenciaga to Basics,” I talked about how we pivoted from building a fuzzer to building an API observability tool after we realized that developers needed a much simpler solution than what we set out to build. Building out our observability solution has been a process of making our product even simpler and even more “basics.” Over the course of the last year, we’ve learned that the state of many teams’ monitoring is far from what they’d like it to be—and definitely far from what it’s “supposed” to be. Understanding just how much teams need more accessible monitoring and observability is a big part of what motivates our team.

As you can see, my understanding of observability system understanding is, as they call it in the news business, an “evolving story.” I’d love to hear from more software developers (and others!) about the patterns you’ve seen around software understanding. I’d also love for you to try out the Akita beta.

‍

‍Photo by Fotis Fotopoulos on Unsplash.