There are a lot of people out there telling you how to write better software.
Use types to catch bugs before testing. Write unit tests. Write end-to-end tests to catch more bugs. Fuzz! Oh, and instrument your code so you can test in prod to find even more bugs.
Read the Internet and you’d think that most companies have a seventeen-step bug prevention routine that starts with types and ends with re-instrumenting all of your services every time you make any code changes so you can precisely trace every request to the ultimate response fifty services away. (This is not all that different from skincare routines, which often end up being many more steps than a person can count on one hand. See my Twitter thread for a fuller analogy.)
But what if you don’t need to have the glass skin equivalent of software quality? You’re working on a web app with maybe millions of users, but you don’t need to scale to every human on the planet. And because you’re not in banking or healthcare, buggy interactions in many user flows can get corrected by refreshing, eventual consistency, or customer service representatives. What if you’re looking for… a one-step routine?
In this post, I posit that if you wanted just one tool to tell you about your bugs that mattered, an API monitoring tool is a good candidate. But not the API monitoring tools of today: we need to see some innovation in this space!
Here’s a dirty secret that most of us already know: bugs don’t matter unless they show up in prod. And only if they show up in the critical prod flows.
Back in the day, when code was small and self-contained, it may have been possible to identify and fix all possible bugs. With average phone apps alone being thousands of lines of code, the number of bugs is long beyond the point that we can reasonably expect to fix them all.
Sure, many bugs are bad because they could happen in prod. And there’s no denying that types, static analysis, and other code-focused techniques help with whole classes of bugs.
But there are a whole lot of possible bugs that are unlikely to show up in prod because there would have to be a pretty unlikely sequence of events for the bug to occur. And yes, some of them are pretty bad if they happen. For those, we should take the necessary precautions. But the majority of software teams are not going to prioritize fixing bugs that are recoverable and infrequent.
Because of this, understanding how, when, and how often bugs occur in production is the best way to prioritize bugs in a modern system.
When people think of prod, they often think of performance optimization and error tracking. They think of lots of logs, traces, and dashboards. They think of technical power users running around wielding their technical power tools.
When people think of APIs, they think automation and good developer experience: Stripe; Twilio; Okta. Or they might think of design, governance, and management.
But more and more, prod is APIs. The rise of SaaS and microservices means that, increasingly, software is made up of services calling other services. Understanding production behavior means understanding the web of who is calling what, when, and how.
Trying to root cause an incident? Debug high latency? The picture is a lot clearer if you know whether a service that your service depends on, that you’re calling through an API, is contributing to your issues.
Okay, you might now be saying. But we have tools to understand prod. What about Prometheus, Grafana, and all of those technical power tools that we see technical power users running around using?
The problem isn’t that you can’t monitor prod. It’s that it is not so easy to monitor all of prod. With application performance monitoring and observability tools, it’s easy to end up with either too little or too much information about prod. Monitor on a small set of known issues and you don’t have the coverage you’d ideally like. Turn on monitoring for everything and you get a fire hose of information, requiring you sift through lots of logs and/or look at lots of dashboards to see what’s going wrong. This requires some understanding of the system under monitoring, as well as time. some understanding of how to read low-level dashboards—and time.
Looking at prod through the lens of APIs is like looking at biology through cell theory. You could try to understand how animals and plants function by looking at pH levels, or looking at how well they’re doing overall. But the minute you realize they’re made up of cells and start looking at how things are flowing in and out of cells, it gives you this whole abstraction that unlocks all kinds of understanding and predictive power about the system. It’s the same with APIs: you can either look at your entire system in terms of low-level metrics like “how many errors am I getting overall,” but being able to specifically associate high latency, errors, and more with specific APIs, especially when you start understanding how the APIs are interconnected, is incredibly powerful.
In the last ten years, production behavior has surpassed code to become the source of truth about software. In the next ten years, there will be major innovations around production tooling for developers. Not just production tooling for ops and infra teams: production tooling for app development teams.
Since finding production bugs increasingly means finding API bugs, I predict major innovations in API tools for production in the next decade. This means:
It’s time to have a higher-level framework for understanding what’s going on in the complex organisms that are our modern software systems. Like cell theory, API-based tooling provides a much-needed framework on top of the logs, metrics, and traces for understanding production today.
For those of us looking for a one-step bug-finding routine, API monitoring tools are a good bet for the future. I’m looking forward to seeing how the industry builds more structured tools for prod.
And if you’re curious about what we’re building at Akita, check out our beta.
Photo by Chris Ried on Unsplash.