Why It’s Hard to Catch Bugs You Don’t Know About

by

Jean Yang

So you’re still thinking about your last incident, wondering how you could have done better. Next time, you tell yourself, you’re going to know about the issue before a customer reports it.

But, as you probably know, anticipating customer issues is very hard. It’s far easier to catch an issue that you’ve seen before. Unfortunately, the majority of bugs lurk in the poorly-understood corners of your system.

In this post, I’ll talk about why it’s so hard to catch bugs you don’t know about, what a solution needs to look like, and how we came to build per-endpoint alerting at Akita to solve this problem. This blog post is deeply influenced by the fact that I spent a lot of the recent long weekend battling a different kind of bug in my yard.

Why it’s hard to catch bugs you don’t know about

Before I talk about software bugs, let’s first talk about my ongoing struggle against… ants.

About a month ago, I moved somewhere new, where for the first time I have a yard. Last Saturday, I got a new dwarf olive tree that I was planning to pot. After procrastinating for a day and a half, I go to pot the tree and… ants. I follow a trail leading up to the tree and… more ants. Were there existing ant colonies that attacked the tree? Did the tree come with a pot of bonus ants? I didn’t know much, but I knew that this many ants cannot be good for plants.

This wasn’t the first encounter with ants. My entire last month has involved a series of battles against these same ants. We had inherited the yard without knowing a whole lot about yards or ants, so we tried all of the options, starting with the easiest:

Do nothing. Hey, ants weren’t in the house, just the yard. Ants eating exposed DoorDash deliveries is just part of the cycle of life. And who knows—maybe the ants will move on their own.
Call in external backup. After a week, it was clear the problem was getting worse, not better. So we called in a professional. The exterminator seemed knowledgeable, sprayed all around the house, and addressed an ant colony in a nearby tree. But there was a catch. “We don’t necessarily remove the root cause of the problem,” he said. “You might have to call us back often.”
Set many traps. At this point, we accepted that we had to get to work ourselves. We acquired ant traps made of sugar and boric acid, which work by having ants take the borax back to their colonies to eradicate the queen. The challenge: we needed to set ant traps near the colonies. We did our best to follow the ant trails, but there was only so much we could do given our limited understanding of ants and the yard itself.

This brings us to last weekend, when we realized we were not done. There was strong evidence of ant colonies that nobody knew about before. But how were we supposed to know?

Now you might see: your pain is the same as my pain. As with yard pests, software bugs can be extremely confusing to eliminate if you don’t know where to look for the problem. And, a lot of the time, doing nothing is probably not a viable solution. Here’s a quick rundown of existing software monitoring solutions and where they might leave you hanging:

Aggregate monitoring. If you’re using a cloud provider, it likely comes with a monitoring tool for overall system health, that helps you understand if overall system performance or error count exceeds certain thresholds. But if you feel the way I do about bugs, including ants, you’re probably happy to allow them to exist peacefully in some places, but not too many in one place and not too close to home. Setting a threshold on total bugs in this kind of situation will likely lead to over-alerting or under-alerting.
Log-based monitoring. There are plenty of great tools out there that excel at telling you when alerts that you’ve set go off. This is obviously very useful and needed. But, just like how I got into a bit of a rough spot needing to know where to set ant traps, these solutions don’t guide you towards figuring out where you need to set the alarms. You either need to understand your system well enough, or do the ol’ trial and error.
Tracing-based observability. Tracing is one tool that can help you understand where problems are coming from. Traces can help provide the provenance of the issue across services. Tracing is the software equivalent of me having ant sensors all over my yard. The catch? Somebody has to instrument all code, including all new code, for the traces to work. Any piece of code that isn’t instrumented is a black hole in tracing—and could hide bugs. Side note: if I had the time and skills to set up ant monitors all over my yard, I might be able to fix the sprinkler issue that’s likely causing this problem in the first place. 🙃

Many application monitoring solutions out there help you identify issues you know to look for, but they don’t make it easy to find issues you don’t already expect to see, or remove the root causes of issues you are seeing. Too often, eradicating the root causes of issues involves either getting lucky or spending the long amount of time necessary to deeply understand everything that’s going on.

Wanted: a monitoring solution in the “tricky middle”

I don’t like relying on luck and most of us don’t have time to understand everything all of the time, so let’s talk about how we can get ourselves to a better place.

To summarize our same pain:

We have bugs in places we don’t want them.
We know that it’s okay for bugs to be in some places, but not others.
We don’t have a great understanding of where the bugs are coming from.
This is partly because we don’t have a full understanding of the system we’re now in charge of, or where it’s okay to have bugs.
We don’t have unlimited time/resources.

To summarize what we want:

Helps get rid of bugs.
Doesn’t assume we want to get rid of all bugs.
Doesn’t require us to understand where the bugs might be.
Doesn’t require us to fully understand the system we’re in charge of.
Doesn’t require a huge setup effort.

In terms of my ant situation: I don’t understand where it’s normal for ants to live or what an acceptable number of ants should be. I just know I don’t want ants to attack during my al fresco meals or to eat my plants. Because I don’t understand yards or ants, a solution that simply gave me all information about ant activity across my yard wouldn’t mean much to me. Because I don’t understand yards or ants, I’m also in no position to figure out what parts of my yard to monitor for ant activity. I do have a fair amount of confidence in my ability to identify a problematic ant situation. What I want is a solution that can identify problem spots, let me take a look to figure out if they’re actually problematic, and potentially give suggested fixes.

Similarly, in software, many developers—my own team included—don’t have a complete understanding of our systems. We know our users generally need to have a working product. A drop-in solution that monitors in aggregate all errors and slow endpoints doesn’t help us focus on the parts of the system that we know we care about. And a solution that requires us to understand where issues might be—or, even more advanced, how their system components fit together—is too often out of reach.

For both my ant problem and my bug problem, the ideal monitoring solution is somewhere in between “the user specifies nothing” and “the user specifies everything.” As the user, you know you don’t want to monitor the whole system the same way, or to offload all insight to an AI—but you don’t know enough to specify exactly what you want. The holy grail is a drop-in solution that can tell a non-expert what they need to know in order to apply their own insights to solve problems at their root. I call this zone the “tricky middle” because it’s a lot easier to build a solution that either requires the user to specify everything or nothing.

Our solution: drop-in, per-endpoint monitoring

At Akita, the “tricky middle” is where we’ve decided to live. Our solution: drop-in, near real-time monitoring that allows users to specify what they care about on a per-endpoint level. Our solution is somewhere in between what people call observability and what people call monitoring, so let me explain what it is by explaining how we got here.

At first, we had just one requirement: our solution needs to be able to drop into a system and tell users what’s going on with their APIs. Towards this, we first built an eBPF-based solution that passively watches API traffic on the network and automatically analyzes that traffic to generate API specs. The analysis includes automatically grouping traffic into endpoints (including path parameters) and automatically inferring information about data types, authentications, and more. Because the solution is based on passive traffic watching, no code changes are required: no need to include a library, call an SDK, or add logging statements.

After we shipped traffic-based API specs, we learned an important lesson: people get overwhelmed seeing all of their API activity. People wanted to know which endpoints to pay attention to, based on usage and latency. People also wanted to know what changed about their systems, across deployments and over time. We had been doing the equivalent of compiling lists of where ants could be, based on where we saw any activity. What people wanted was to be able to explore that list prioritized by amount of activity and amount of new activity.

After this feedback, we went back to the drawing board and built a near real-time API observability solution that tracked usage, performance, and errors across API endpoints, all inferred by passively watching network traffic. The Akita agent is responsible for collecting API traffic and obfuscating out payload data to send to the Akita cloud, which infers API structure in order to support exploration of the API endpoints, per-endpoint metrics and errors, and more.

We knew we had finally built something accessible when users started using our dashboards regularly—and asking for alerts. We’re excited to finally start rolling out our per-endpoint alerts in alpha. Docs here.

Akita's per-endpoint monitors, now in alpha.

An example of tracking a monitor over time.

For those who are curious about why I say our solution lives somewhere between monitoring and observability: when people talk about monitoring they typically mean health and when they talk about observability they typically mean understanding why systems behave the way they do. While what we’re building doesn’t do the tracing that people typically associate with observability (and the “why”), automatically inferring API structure gets developers a whole lot closer to root cause and makes it much easier to understand how systems are changing—which is one of the main goals of observability.