Software, thoughts, and stuff

So you want to be an engineering manager

June 23, 2025 · 17 min read

TL;DR

This is a summary of hundreds of conversations I've had with prospective engineering managers about what life looks like as a manager- the purpose and responsibilities of the role, but more importantly- what you actually do every day.

I chat with a lot of younger engineers about their future, career goals, and what kind of work makes them happy. By far, the most frequent conversation I have is about engineering management- what it involves, how to get into it, and whether it's a good fit for them.

I've also worked with people who've been thrown into management without any real guidance or preparation. This can be a really hard situation, because you absolutely need to be able to ask for help, but it can be terrifying as a new manager to admit that you're struggling.

I thought I'd spend some time summarizing the conversations I often have with prospective managers of what I think life as an effective engineering manager looks like. I'll start with the high level purpose, and then we'll get into the details of what day-to-day activities look like.

Using AI to create a Kubernetes controller in a hurry

May 27, 2025 · 16 min read

TL;DR

I've been looking for an excuse to do something more deliberate and ambitious with generative AI developer tools, so I created a Kubernetes controller which discovers Kubernetes-managed AWS load balancers, scrapes their CloudWatch metrics, and exposes them as Prometheus metrics (which have an infinitely better developer experience).

I'll share what I learned: the magnitude of the productivity boost, how effective it was at teaching me, some strategies I landed on, limits to the agentic tooling I used, and some surprising gaps in the models' abilities.

Since I've been primarily working in platform and infrastructure recently, a lot the "code" I've been writing is domain-specific configuration languages (e.g. terraform, helm charts, OTel collector config, Kyverno policies, etc). Despite so many glowing testimonies of massive wins with generative AI dev tools, the results from my first few attempts to use them were a bit underwhelming. My guess is that config code usually doesn't get open sourced, so the models just don't get a lot of good examples to train on.

This has left me hankering to do something more ambitious, and probably in a general purpose language, where I can get a better sense of what the tools can do (beyond fairly mundane auto-completion).

Then recently, I had a few days where a lot of my colleagues were off on vacation (school vacation in Massachusetts), and it occurred to me I'd been quietly stewing on a problem that might be a great candidate for my own little hackathon.

Observability Signals: choose wisely

February 25, 2025 · 16 min read

TL;DR

A quick review of the "three pillars" of observability: logs, metrics, and traces- their strengths and weaknesses, how to match the right signals to particular use cases.

The fun part: we'll explore this via a real-world app, adding observability as it grows from a small monolith into a large, microservices architecture.

Three pillars, inscribed recordum, mensurae, and vestigium

I spend quite a bit of time helping developers figure out how to make sense of telemetry data they've collected. Often this is showing them to navigate our observability tooling, how to think about visualizing data, or just tossing some saucy query tricks into the mix.

Sometimes though, I find that their telemetry is... ill-conceived. I occasionally find metrics being written like they were a point-in-time event, missing context in logs because someone was worried about adding too much cardinality, or log lines with variable and static context needlessly mashed together. Alas.

This kind of thing makes me wish I'd had a chance catch them earlier, when they were just beginning to consider instrumentation choices.

Given how often this happens, I've built up some insights on some of the most common misconceptions about the different observability signals, what makes them different from one another, and why you'd choose a specific one for a given situation.

OpenTelemetry's secret weapon

February 12, 2025 · 10 min read

TL;DR

As OpenTelemetry's adoption has surged, it's drawn increasing criticism: it's complex, isn't fully matured, and its user experience can feel... unpolished. While these are valid gripes, I think we've hit an inflection point where OTel's benefits outweigh its pain points, especially when compared to the alternative of proprietary telemetry pipelines and lock-in with the dominant (and outrageously expensive) vendors.

In the next year or so, I think its benefits are going to increase dramatically due to its secret weapon: semantic conventions. These conventions allow any observability vendor to create the same rich, powerful, out-of-the-box user experiences that the dominant players had locked-down via their ownership of the entire telemetry pipeline.

Now that OpenTelemetry has gained such significant traction, it's starting to attract a lot of attention beyond the hardcore observability community. While most of what I read about OTel is pretty positive, it also draws its fair share of shade.

Honestly, I get it. As a big user of OTel, I've spent plenty of hours rage debugging OTTL filters in OTel collectors, desperately searching for SDK examples that actually work, or pulling my hair out trying to figure out which version of a protobuf schema changed and broke telemetry from my Swift clients. And like a lot of the haters, I also get frustrated that the overall level of complexity in the specifications leaks into the implementation details of every part of the project.

All that said, I'm incredibly thankful that OTel exists. While it's still in its awkward teenage years, it's already changing the industry dramatically. Despite the challenges that currently exist, I think OpenTelemetry is the only reasonable path forward for observability.

Are we ready for Observability 2.0?

January 30, 2025 · 19 min read

TL;DR

Observability 2.0 is a vision of observability that seeks to replace the traditional "three pillars" of observability (metrics, logs, and traces) with a single source of truth: wide events.

This vision is compelling, but there are a number of obstacles that make it difficult to adopt in practice. We're now thinking about Observability 2.0 as a philosophy we can work towards gradually.

At SimpliSafe, we manage a pretty large system of microservices. Because we're entrusted by our customers to protect their homes and families, we take reliability of our systems pretty seriously, so observability is pretty important to us.

Like most companies today, our observability strategy is built around the "three pillars" approach: metrics, logs, and traces. Of the three, we're currently the most dissatisfied with our logging tooling, and have been working on finding a better product.

We're already a Honeycomb customer, and in our conversation with them, they ended up making a pretty interesting case that we should consider a new approach: ditch the three pillars and make traces the center of our strategy. We had up to this point been thinking very incrementally, and here was Honeycomb, coming in hot with a bold and revolutionary vision: Observability 2.0.

Multi-platform container builds with BuildKit

January 2, 2025 · 13 min read

TL;DR

At SimpliSafe, we wanted to take advantage of the cost savings and performance improvements of AWS's Graviton processors (ARM64), but wanted to do it incrementally to manage risk.

We built an autoscaling, docker-compatible build service using BuildKit, which could build multi-platform container images, and then used Karpenter to auto-provision Graviton-based nodes.

This is paying off- we're seeing the expected 30% better performance per EC2 dollar spent, as well as some surprising benefits to developer productivity.

AWS's Graviton is a 64 bit ARM-based CPU available on EC2. Why, you ask, would one want to use Graviton-based instances when trusty old x86 instances have served us so well in the past?

Well, for one, you'ds see a roughly 30% improvement in price/performance versus instances with x86 chips. This performance difference is even more pronounced at higher utilization, because unlike x86 chips, a Graviton vCPU is an actual CPU core, NOT a hyperthread you're sharing on a core with some rando. This should allow you to scale down, increase CPU utilization more than would be safe with hyperthreads, and still handle the same load.

Incremental IPv6 with Kubernetes

October 13, 2024 · 12 min read

TL;DR

Due to looming IP address exhaustion, we've been migrating my company's Kubernetes workloads to IPv6. While IPv6 has its sharp edges, AWS EKS's new IPv6-only mode and better OSS ecosystem support has made it possible to adopt incrementally.

Here's a bunch of tricks I've picked up in the process.

At my work, we've been struggling a bit over the past few years with decisions made (almost 10 years ago now) about our AWS network design. While we have a full class A private network (16,777,216 IPv4 addresses), we've managed to paint ourselves into the very sad corner of looming IP address exhaustion.

There's a few reasons:

Our integration with cell network carriers (to support our home security systems) requires a huge chunk of our IP space
Our decision to use a multi-account architecture in AWS, and that we chose to use a flat IP space across our accounts. This means our IP space is fragmented across accounts, regions, and availability zones, making a lot of that address space effectively unusable.

Even with all of this, we might have been fine... until we went big on Kubernetes.

What would an OSS developer platform even look like?

September 23, 2024 · 15 min read

TL;DR

My team has built a developer platform that our developers really like, and is providing a ton of value for my company. But I'm struggling to figure out if and how we might open-source it. I'm looking for advice from you.

As a platform engineer, I enjoy the benefits of working in a field with a vibrant ecosystem of open source infrastructure and developer tools. I've spent much of the last decade building developer platforms by curating and assembling these tools, and after a number of iterations, I seem to have hit on something that's working really well for my current company (SimpliSafe).

As our platform's adoption has grown, we've gotten more and more frequent, really positive, heartwarming feedback from our developers who really like it. This is absolutely freaking delightful, and honestly never stops surprising me.

I often get asked by our developers if we should consider open-sourcing the platform. I've spent some cycles entertaining the idea, but I usually don't get very far before it seems unworkable.

This post is an experiment in thinking in public; I'd like to brain dump my thoughts on the challenges of building an open-source developer PaaS, in the hopes that the platform engineering community might provide some insight to get me past this block.

Building culture is hard, sustaining it is harder

August 6, 2024 · 17 min read

TL;DR

I experienced first-hand what it was like to work in a company with a really strong culture of knowledge management, and watched what it took to build and sustain it. I also witnessed the factors that caused it to eventually crumble.

My current company is struggling with some challenges that are pretty typical for a wildly successful startup that's rapidly grown into a medium-sized company. We've got the expected technical debt, organizational design challenges, and a seemingly infinite number of small systems that work great... until they catch fire as we hit new scaling thresholds.

That one time I did something important

June 2, 2024 · 17 min read

TL;DR

This is the story of the most impactful accomplishment of my career (building Vistaprint's Studio), which happened to be as an individual contributor.

For those of us who've actively chosen to remain active technologists, and have resisted the pressure to join management, it's important to remember that innovation is ultimately driven by individuals.

A commonly accepted notion in software engineering leadership is that managers have a much bigger potential for impact on a business than an individual contributor. This is certainly a credible argument, given that a great manager can have a huge impact through building a great team. They're responsible for recruiting the right people, steering the culture, and making the biggest decisions about what risks to take, what opportunities to pursue, etc. Ultimately, they're accountable for what the team delivers.