The Software Heterogeneity Problem, or Why We Didn't Build on GraphQL

by

Jean Yang

One of Akita’s most frequently asked questions is why we didn’t build our one-click observability solution on top of GraphQL.

It’s a good question. GraphQL is hip, it’s clean, and it’s rising in popularity. GraphQL makes your life easier with fancier schemas and flexible database queries. GraphQL would make our lives easier by essentially handing us a dossier about each of your services. Other cool companies like Apollo and Hasura are all building on top of GraphQL!

But at Akita, we’re very much not building on top of GraphQL and have instead decided to take the more labor-intensive route of sitting on your network and watching your API traffic. It’s not as hip, not as easy to explain, and doesn’t give us nearly as much clout when we talk about it on the internet. The reasons for our choice have very much to do with our views on the inevitability of the Software Heterogeneity Problem, so this blog post is about that.

In this post, I’ll talk about the dream of one-click observability that we’re building toward, why a GraphQL-only world would certainly make that dream easier, and why the Software Heterogeneity Problem means that tackling GraphQL alone is not going to be enough. And if drop-in, one-click observability across your services is interesting to you, we'd love to have you join our private beta!

The dream of one-click observability

The rise of SaaS and APIs have meant that it’s easier than ever before to imbue your software with new functionality. But it’s gotten harder than ever before to understand your systems.

What endpoints of my payments actually get called and by whom? Where is email data getting sent across my APIs? As software systems have transformed from planned gardens to organically evolving rainforests, it’s gotten harder and harder to answer these questions.

Devops observability tools, from companies like Datadog and New Relic and Honeycomb, have stepped up to fill the gap. By exposing logs, metrics, and traces and giving developers a way to explore them, these tools give developers new sources of visibility into their systems. But logs, metrics, and traces are something like an assembly language for system understanding—and, today, getting this data requires having a basic understanding of your systems. What if understanding your systems could be as easy as writing Python?

At Akita, we set out to build a solution that any developer could drop into any system to start answering questions about how the APIs are talking to each other and what they’re sending. You don’t have to know what your services are ahead of time. You don’t have to instrument your code. You don’t have to run a proxy or integrate with an API gateway. Just drop us in to watch your API traffic and we’ll start working our magic to learn the API graph.

Why GraphQL would make one-click observability easy

If you’re a GraphQL fan, you might be thinking that we could have taken such an easy route.

GraphQL literally gives you everything you want. A query language for your API! That has types! That literally thinks of itself already as a graph! It makes sense that companies like Apollo are focusing on building on top of GraphQL to help companies tame the API graph.

Yes, GraphQL is great, but it has the following requirements that makes it untenable for one-click observability:

You have to buy into the GraphQL paradigm
for all of your APIs.

‍

But GraphQL doesn’t solve the Software Heterogeneity Problem

But how we got here into this mess is what I call the Software Heterogeneity Problem: the fact that modern software is spread across more and more different languages and platforms.

I first came across the Software Heterogeneity Problem during my PhD, when I was working on programming language design and program analysis. I had prototyped a new programming language, started building web apps with it, and realized that all of my guarantees went out the window the minute I called out to the database—which happened often. I then realized that all systems of nontrivial complexity were actually made up of lots of different languages. And this heterogeneity meant that no single language design or software analysis could work for The Whole Thing.

My first response was to figure out how to make the Software Heterogeneity Problem go away. Was there some way to make a framework so unified and so grand that everything else could fall into it? After spending years trying to reason away software heterogeneity, I came to conclude that heterogeneity is inevitable for two reasons:

Constraints on development speed and software quality are likely to evolve as systems evolve, meaning that tool requirements evolve.
The tools themselves evolve! The language or framework that made the most sense at the beginning of a project may not make the most sense for a host of tool and tool community reasons ten years in.

And, of course, there’s the important anthropological effect among developers that I will leave to xkcd to explain.

xckd explains the prevalence of the Software Heterogeneity Problem.

The Software Heterogeneity Problem applies not just to application level languages, but all parts of the tech stack. Here’s a chart from Postman’s State of the API report showing the fragmentation around API architectures. GraphQL is currently at less than 25% prevalence among companies; REST is at almost 100% prevalence. My take: it’s not out of the question that REST will get replaced by GraphQL in many places, but REST is here to stay. And by the time GraphQL has taken over, there’s going to be a new, hip interface description language in the picture.

REST is still the dominant API architecture.

How we’re going after blackbox observability

Let’s go back and take a look at our goal: to build a solution that any developer could drop into any system to start answering questions about how the APIs are talking to each other and what they’re sending. This means the solution should ideally:

Work with any system, regardless of the programming language or API architecture.
Work with any system quickly and noninvasively, meaning code changes or proxies are not desirable.

This is what led us to pursue solutions that were entirely blackbox and could understand API behavior without requiring any hooks into services themselves.

My team and I have written many other blog posts about our solution (see here, here, and here), so I’ll save the dramatic tension and just tell you how we do it. After exploring many possible ways of integrating systems, we settled on the method we felt was least invasive, based on using pcap filters to passively sniff API traffic from the network, turning those into HTTP Archive (HAR) files, and then inferring API models from the traffic. We’re building towards having a solution that you can drop into a Kubernetes cluster or to watch test traffic—and then be able to explore your API endpoints, how they’re talking to each other, and how they’re changing across code changes, without having to instrument your code or massage data.

We’re excited about the generality of our traffic-watching approach because it generalizes across tech stacks. Right now, we support HTTP/REST traffic and generate OpenAPI 3 models. Our methods are general enough to expand to other API architectures and protocols; it’s a matter of our team getting the bandwidth to put in that work.

On accepting heterogeneity

It’s not a complete blog post without a good dose of evangelism, so I’ll end with a little checklist of dos and don’ts towards dismantling the false religion of the “silver bullet” that has overtaken programming tools:

More planning as if the legacy parts of your tech stack are sticking around
More looking for tools that embrace system heterogeneity
More making sure a tool still has value, even if it’s not running across your entire system
More collecting data on what tools are actually in use across your system, versus what you wish were in use
More skepticism of any “silver bullet” 🙃‍
Less expecting the hot new tool to subsume the rest of your tech stack‍
Less expecting high adoption of tools that require the entire rest of your system to adapt‍
Less sweeping the non-ideal parts of your tech stack under the rug when thinking about your system 🧹