Apr 27, 2022
December 15, 2021
·
6
Min Read

Joining Akita to Build the Observability Solution I Wish I Had

by
Brennan Moore
Dog with "Welcome" monitors
Share This Article

For the last four years, I helped build out the tech for Cityblock, a healthcare company, as the first engineer. One week ago, I interviewed at Akita Software. This past Monday, I joined as the Founding Product Engineer.

A year ago, I would not have expected to transition into developer tools. But, given my background and what I’ve seen, joining Akita made a lot of sense.

For a lot of my career, I’ve led fast-moving, resource-constrained engineering teams where reliability was crucial. So when I heard about Akita’s approach to observability, I was intrigued. I liked the “no code changes, no proxying” approach they took. I liked that they were going after making observability more accessible by building for “99% developers” instead of expert SRE teams. The team believed that providing the right developer experience was going to be a key advantage — and that is exactly where I fit in.

In this blog post, I’ll talk about my experiences moving fast with resource-constrained engineering teams—and how that led me to work on building the future of API observability full-time. 

Keeping the site up with under-resourced teams

One of the main reasons I’m excited to join Akita is to build for my past teams. 

My first major experience being responsible for keeping a site up was at Artsy, a platform for learning about and buying fine art, where my team was responsible for Artsy’s global web presence and live-and-online auctions. We had roughly two teams of three people split across a few services and handled millions of daily artwork views and millions of dollars in global transaction volume for live auctions. As the Director of Web Engineering, I was responsible for leading our transition to microservices to improve test suite speed, deploy time and site performance. Pre-microservices, we were a Rails monolith using new Relic for API and route specific insights. But after we split up our app into microservices, our debugging time increased dramatically and we did not return to the monolith level of visibility into API call behavior for years.

In my last role as the Head of Engineering at Cityblock, the reliability challenge was mainly due to the high complexity of the app. By the time I moved on, we had grown to a 20-person engineering team responsible for delivering high-quality clinical and social service care for people with critical medical needs who may be living with less than $17k/year, with upwards of 10-15 emergency room visits per month. In order to serve the business needs at our team size, we leveraged an enormous number of third party tools. (At ten engineers, we had roughly 100 third-party tools managed by one IT person!). With a few microservices thrown into the mix, each engineer was the single point of contact for multiple vendor relationships and the sole owner of multiple internal services and/or feature areas.

While my teams certainly could have done things more “correctly”, we were not alone in our practices. The teams I had helped build were part of a trend of teams becoming smaller and more junior while managing more infrastructure and third party integrations than ever before. As more bootcamp grads and people from nontraditional programming backgrounds come into the workforce, I expect the need for simple observability tools for distributed systems will only accelerate.

Because reliability was crucial in my past roles, I’ve spent a lot of time thinking about observability. In fact, in the last six months of my time at Cityblock I switched to working on observability full-time. As someone who thinks a lot about my tools, I spent a lot of time wondering what a better observability tool might look like.

Where observability fell down

Through my experience with observability tools, I came to believe that there is a big opportunity to build easy-to-install tools that help get signal from the noise, especially as infrastructure complexity grows.

Observability is hard to set up

In my experience, observability has required a significant time investment. For anything beyond basic server metrics, we needed to instrument our code by adding libraries to systems in Go, Python, NodeJS, Scala, Elixir… the list goes on. Then engineers need to be onboarded into new tracking practices. There are many implementation “foot guns,” like knowing that log based metrics that only look forward and distribution metrics require a pre-existing understanding of the event’s distribution. In Datadog, for instance, an engineer needs to think about whether the event should be a COUNT, RATE, GAUGE, SET, HISTOGRAM or DISTRIBUTION. This creates friction not just during setup, but as you scale. As we scaled, for instance, we hit limitations on a COUNT metric and needed to alter working code to summarize events in a RATE. This required adding event strings across many services and restructuring working code to track events properly.

Observability is hard to scale

The teams I’ve been on have grown quite a bit: Artsy went from four engineers to 20; Cityblock went from no patients to serving hundreds of thousands of patients across five states. While many of our dev tools scaled nicely with our team, our observability tools did not. As we 2xed our infra, our observability tools got 4x more complex to use. Each new system brought server metrics, interaction points, background processes, and third-party integrations. For our user-facing application, we tracked nightly data exports and data processing jobs, data imports from various services, 3rd party services as well as individual features. Each new thing we added to our dashboard made the dashboard more difficult to interpret, alerts more noisy, and on-call rotations more burdensome. As a result, only our most senior engineers could understand them.

Observability tools are expert tools

As our systems grew, what started out as a simple dashboard quickly became an expert tool. For example, one of the third party tools would sometimes stop sending us data or send us invalid data. (Side note: when we asked for a status page they said ‘our systems are far too complex for monitoring.’ 🤔) This meant we needed instrumentation from the point of ingest (did they stop sending us data?), to data processing (did the data change shape?) and finally to display (did we break it?). It required both a familiarity with our systems and sufficient understanding of what issues we might care about to spin up new tools (such as Great Expectations) and create dashboards to show the appropriate metrics across many programming languages and pieces of infrastructure. 

What I’m excited to build

Brennan at the team off-site
Me, working on this post at Akita team off-site!

There are a variety of ways to try to tackle the problems I’ve outlined. I believe Akita has the right approach.

At Akita, the team has built an innovative, easy-to-deploy solution that monitors your system with no need to instrument code and works by passively watching API traffic. I particularly like Akita’s goals around developer experience. I have seen first-hand the destabilizing frustration of fire-fighting broken deployments, using tools that merely give me back only the logs, metrics, and traces I painstakingly put in myself. I am excited to apply my experience building fun and beautiful consumer-facing and enterprise software to this domain.

At Akita, I expect to:

  • Build for '99% engineers' over SREs
  • Develop for '99% teams' over FAANG (Facebook, Amazon, Apple, Netflix, and Google)
  • Aim for simplicity and actionable insights

In the next few years, I look forward to closing the gap between best practices and actual practice when it comes to observability.

And if joining an early-stage team working on this problem sounds like fun to you, we’re hiring!

Photo by Pavel Herceg on Unsplash.

Share This Article

Join Our Private Beta now!

Thank you!

Your submission has been sent.
Oops! Something went wrong while submitting the form.