Here’s a scenario I watched play out a lot at a previous job: the often-reported, never-prioritized bug.
You know something’s broken, your users know something’s broken, and it’s annoying enough to get reported but never so annoying that it seems worth the effort of tracking down what’s causing it. Until it’s very suddenly really, really worth the effort.
In this blog post, I’m going to tell you the story of a long-ignored bug that finally came to be fixed while I was Product Manager at a now-public cloud infrastructure startup. This bug, like many bugs, initially seemed innocuous but ended up being so customer-impacting that customers were escalating and so team-impacting that it took us two weeks to root cause. I’m going to talk about why fixing it was so hard, how if we had a tool like Akita we could’ve avoided a lot of pain, and what I’m excited to build at Akita, as Head of Product, to help fix that pain for others.
A bug we could no longer ignore
I spent a good portion of my time at the cloud infrastructure startup as the Product Manager of a team building a very popular product for millions of users. The product involved working with lots of contributors, from big-name tech companies to hobbyists, to share their work with all our users. We had thousands of projects being shared with users all over the world, 24/7. It was my job to identify ways to help build even better means of sharing and using those projects, and I loved doing it.
The bug I want to tell you about today was by no means our only bug. Like any tech project older than a week, we had a series of things we knew we needed and wanted to fix. The bug in question finally became unignorable when several high-profile users were confused and publicly vocal about it, prompting several high-profile contributors to get very upset about it. These contributors very rightly escalated to their internal contacts, and, subsequently, several very senior, very VP-and-Director-y people at the company began asking in company Slack channels when it would get fixed, in addition to some daily DMs directly to me with the same question. It was clear to all of us that our partnerships were potentially on the line. Very quickly, not fixing the bug was costing us more than fixing the bug.
How did we get into a situation where we were prioritizing bugs through users escalating to VPs? Let me tell you a story that I’m betting is probably going to sound a lot like something you’ve seen too. We had a service that hosted files for people to use: from a product perspective, a straightforward spec. But from a technical perspective it was, like so many applications, a complex, interconnected web of endpoints talking to various other services through several other tools. And the bug that I mentioned? It was about how we represented how many times a given file had been used. Yep, this single number that should’ve been straightforward was anything but, because there were at least three places in our API that were recording similar but separate types of usage and storing them in separate parts of our database.
I don’t recall who first reported the bug, whether an internal person or a user, but someone who watched this number regularly noticed that it had changed in an unexpected way and they didn’t know why. So they asked. We knew there was more than one way this number was being recorded, and at first, that’s where our investigation stopped. We always had a long backlog of issue reports from users, contributors, and internal folks, on top of the feature work and day-to-day maintenance of our product and routine interrupts from all around the company (I counted 70+ interrupts in one month once). Because this wasn’t a breaking issue and it seemed to fix itself if given some time, we put it in our list and kept on with our planned work. This continued to happen every couple months until my DMs filled up with concerned Directors.
Eventually, we would come to find out that some sort of scripty magic was trying to create the number by combining the various numbers we collected via a scheduled job, and then processing it a lot to come up with the one true usage number. As you can probably guess, that could easily run into problems, but we had no way of knowing that it was. We had to dedicate one person’s time for several weeks to working out exactly what was happening and proposing an update. So we were down a senior person for our scheduled work. And then, once they found a solution, we had to schedule time for the whole team to implement the fix. Which, of course, meant putting off the other important work we had lined up: a mix of researching the beginnings of a brand new feature and doing some scheduled technical debt, which in turn derailed our product designer’s schedule and required I do a tour of the other teams whose work with us was going to be pushed back as well.
If we had been able to quickly see what endpoints we had and how they were being used (did I mention the API docs hadn’t been updated in at least a year?), then this task that was so overwhelming and time consuming (it took about two weeks to track down the exact details of what was happening), would’ve taken a matter of hours, possibly minutes, to understand. But at the time, we didn’t know of an easy solution for doing this. We really only considered our API in the middle of or in the aftermath of an incident. When we were adding some new feature or functionality, we’d just add a new endpoint without a complete idea of what was already there (how we got ourselves into that bug in the first place).
The problem of “we’ll instrument later”
Our bug didn’t come about because we were doing anything wrong. We were a busy, high-performing team with millions of users and thousands of stakeholders to support and answer to. We did what gave us the fastest returns on our efforts to meet the momentum of the needs of the company. Sounding familiar?
Even on such a high-functioning team with a high-impact product, most of our bug-related tools were after-the-fact. Bugs take time to track down and solve and most of the tools that help us catch them are either incident-reporting tools or tools to use in post-mortem. We had the more buggy or mission-critical parts of our app connected to DataDog and PagerDuty alerts, to let us know when something broke, and we worked with our Ops team to make sure our app stayed up. Everything we knew about was great at what it did, but none of them did what we would’ve needed to solve our bug, which is pointing us to where our API was hurting in real time without us having to write anything to tell it where to look. At that time, the team thought of tools that interacted with our API as heavy-hitters that required a lot of setup, tweaking, and interpretation to be valuable. And they weren’t wrong: most solutions out there, for instance like Honeycomb or Datadog, would have required both time and expertise to set up to the degree that we needed. We mostly stuck with building off of the instrumentation that had already been set up by the folks who sunk the time into it before us, and we were loath to move to something else because we were sure it would take weeks to implement and then months to get it to the point where it was correct and helpful.
And I fully supported the team’s choice on this. From a product perspective, our tools were “good enough” to help us mid-incident or after the fact for really critical things, and we trusted our testing mechanisms to catch major problems before we released anything. If I had known there were tools to catch issues before a bug report or incident, or even more critically, tools to help us in planning new features, I would’ve prioritized it immediately. The hours upon hours spent finding and fixing bugs, rolling something back that missed testing, or updating API endpoints that have confused the team because they are too similarly named, is worth avoiding if possible. Until Akita, it just never felt possible.
If you have recurring bugs, alerts, or interrupts that sound very similar to what I’m describing, you’re in really, really good company. There are a lot of us dealing with these decisions and tradeoffs, dreading what we’ll have to put off doing in order to instrument our systems today.
API awareness: the Akita way
But what if there was a tool that just showed you what you needed to see, without you knowing what information you need or needing to spend any time instrumenting it? In comes Akita, with its drop-in API modeling and high-level insights, all based on watching your API’s traffic.
The short reason why I work at Akita: drop-in API awareness removes the pain of having to pinpoint where problems are occurring while also removing the burden of having to instrument anything. The time commitment of Akita is exponentially less because, once installed, Akita addresses problems before they are even real problems. And being aware of your APIs makes planning new features, writing updates, and solving tech debt easier than it has ever been, no matter how sprawling or complicated your API might be.
When I initially talked to Akita’s founder, Jean Yang, about joining, my first thought was “I wish we had this right now on my team.” The ability to simply drop in an agent that then maps out what it sees in and from our API would have been the kind of magic that solved problems we had all written off as too hard and tedious to fix. Because keeping track of an evolving API that’s in active daily use is really hard! So many things can go wrong, and you often don’t even know until there’s an incident or several very angry stakeholders in your Slack DMs. (If you’re very lucky, both at the same time!) The ability to know when there were quirks and slowdowns before they evolved into problems would have been life-changing, not to mention on-call changing.
If this sounds like you, I encourage you to sign up for our beta and give Akita a try. The more you know about your API, the better your app, the happier your team, and the better your on-call weekends go.
What I hope to build at Akita
I joined Akita in no small part because I wanted my team and teams like us to know about and have access to a tool as game-changing as Akita, and that drive has only grown as I’ve learned more about what Akita is capable of. There’s so much information to be had from listening in on internal APIs, from which endpoints are most used, how fast or slow they are, what requests they are making and responses they are expecting, to which have started to get slower or send more frequent 400s. This is the kind of information I didn’t know I needed until I had it.
If we’d had Akita back when our annoying number bug was around, we would’ve been able to see exactly what request and response was failing during these recurring magic script sessions, saving us literal weeks of effort. Even if we wouldn’t have chosen to fix it immediately, we’d have known exactly where to start when we did choose to. And had our predecessors had Akita, they might have been able to use a single endpoint to gather the data to begin with.
While Akita is still an early product in beta, we have big dreams. As the new Head of Product, it’s my job to mold the product into a state where it’s accessible for any software team, especially teams that don’t have time to instrument. My dream is to build a product where any developer on any team in any company can, with a single command, drop Akita’s agent into their staging or production environments and, in a few minutes, know everything they need to know about their API forever. And while that might take some time to realize, in the next few months, I’m excited to build ways to showcase and highlight more and more of the interesting API data Akita already gathers right now.
If this dream sounds awesome to you, sign up for our beta and help us as we make API awareness a straightforward reality for all developers, no extra instrumenting involved. And if you’ve encountered these kinds of bugs but aren’t in a position to try out the Akita beta, I’d still love to hear from you about the kinds of tooling that you would like to see.
Photo by Matthew Henry on Unsplash.