Software, thoughts, and stuff Blog

Incremental IPv6 with Kubernetes

Sun, 13 Oct 2024 00:00:00 GMT

TL;DR

Due to looming IP address exhaustion, we've been migrating my company's Kubernetes workloads to IPv6. While IPv6 has its sharp edges, AWS EKS's new IPv6-only mode and better OSS ecosystem support has made it possible to adopt incrementally.

Here's a bunch of tricks I've picked up in the process.

At my work, we've been struggling a bit over the past few years with decisions made (almost 10 years ago now) about our AWS network design. While we have a full class A private network (16,777,216 IPv4 addresses), we've managed to paint ourselves into the very sad corner of looming IP address exhaustion.

There's a few reasons:

Our integration with cell network carriers (to support our home security systems) requires a huge chunk of our IP space
Our decision to use a multi-account architecture in AWS, and that we chose to use a flat IP space across our accounts. This means our IP space is fragmented across accounts, regions, and availability zones, making a lot of that address space effectively unusable.

Even with all of this, we might have been fine... until we went big on Kubernetes.

Kubernetes eats IPs for breakfast

Kubernetes eating all my IPs

Kubernetes has been a huge win for us. But it gobbles up IP addresses like Pac-Man with a tapeworm.

It's fairly straightforward math: in AWS EKS, with the VPC CNI integration (i.e. a network plugin for Kubernetes that allows it to integrate with AWS's networking APIs), here's what happens to all your IPs:

The EKS control plane requires at least 16 addresses (at least 6 per subnet)
Every node (EC2 instance) requires at least one address, but depending on your CNI settings, the CNI plugin can eagerly allocate additional addresses to keep "warm" (to speed up pod creation)
Every pod on a node gets its own IP address. This includes not only user workloads, but also every daemonset pod. In our cluster, we have at about 8-10 daemonset pods per node.

This means, as we've migrated workloads to Kubernetes, we've increased the number of IPs we're using by roughly 10x.

This adds up quickly. We've had a few close calls with IPv4 exhaustion during high traffic events where we had to scramble to temporarily kill non-critical workloads to free up IPs, rebalance across availability zones, or provision new subnets to make sure customers weren't affected.

Actually, IPv6 is a thing

Unlike IPv4, IPv6 address space is so incomprehensibly large that it's effectively unlimited. For example, a typical IPv6 private subnet would have a /64 IPv6 CIDR, which is 18,446,744,073,709,551,616 addresses.

fun fact

Apparently a number with that many digits is called a "vigintillion". Numbers this large can only be discussed using your best Carl Sagan voice.

Trillions and Trillions of IPs

IPv6 has been a standard for like 25 years, but is still not widely adopted (for a lot of reasons, including backward-incompatibility, lack of ecosystem support, and ISPs squabbling and dragging their feet).

It's legitimately really difficult to migrate a large distributed architecture like ours to IPv6, because, historically, it would require simultaneous changes across many different systems, along with some scary big-bang moments. It also requires reconsidering a lot of assumptions built into your network design and security strategy.

It's been hard to figure out how to untangle that knot.

Enter EKS IPv6 mode

Given the scarcity (and price) of public IPv4 addresses, and to support the increasing scale of its customers, AWS has been under a lot of pressure to provide more viable paths to adopting IPv6. In one of the smartest moves I've seen from them in a while, they've used Kubernetes's built-in IPv6 support to build a new IPv6 mode for EKS.

Here's the core of the hack: While each node continues to get an IPv4 address, pods get only IPv6 addresses.

Inside the cluster, all traffic is via IPv6, but traffic to and from the cluster gets NATed through the nodes' IPv4 addresses. From the perspective of anything outside the cluster, connections appear to be coming from the nodes' IPv4 addresses. This means only the software inside the cluster has to be modified to use IPv6.

info

Note that if a host outside the cluster is IPv6-enabled, pods may just communicate directly with it over IPv6, and bypass the IPv4 NAT.

This translation of IP version between inside and outside the cluster has allowed us to migrate our workloads incrementally, which has made the whole process much more tractable.

Migrating only EKS workloads, alone, looks like it's going to allow us to reduce IPv4 address usage significantly, perhaps enough to solve our IPv4 exhaustion without any further network changes. Even if not, it should buy us years of additional runway before we hit that point.

What does migrating an EKS cluster to IPv6 require?

Unfortunately, you can't enable IPv6 mode on an existing EKS cluster; you have to create a new cluster and migrate your workloads over. There have been a bunch of specific challenges around this (mostly just minutiae around Terraform wrangling and executing DNS cutovers), but now that we've found most of the corner cases, the process is pretty mechanical.

The bulk of the remaining work is around making any code or configuration changes necessary in the individual workloads to get them to bind to IPv6 addresses.

A few basics

I've been a programmer for like 30 years, and I had never done anything with IPv6 before this migration. There were a few embarrassingly basic things I had to learn about IPv6:

The IPv6 "all interfaces" address is ::, which is equivalent to 0.0.0.0 in IPv4.
The IPv6 loopback address is ::1, equivalent to 127.0.0.1 in IPv4.
URLs that use an IPv6 address as the hostname need the address enclosed in square brackets, e.g. http://[2001:db8::1]:8080 so the colons in the address don't get confused with the port delimiter.
Happy Eyeballs is an algorithm (implemented by most network clients) that allows apps (including browsers) to efficiently decide whether to use an IPv6 or IPv4 address when both are advertised via DNS.

Your OS and language probably supports IPv6

One cool thing is that almost all modern OSes (Linux, Mac, Windows) support "dual-stack": they can listen on a port on both IPv6 and IPv4 from a single socket.

On top of this, most high-level programming languages (and their standard libraries) utilize this feature, so if you bind to the :: (all interfaces) address, you'll be able to listen on both IPv4 and IPv6 at the same time.

For example, in node.js:

const http = require('http');

const server = http.createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/plain' });
  res.end('Hello World!\n');
});

// binds to port 8080 on all IPv6 and IPv4 interfaces by default!
server.listen(8080);

Or you can do it explicitly:

// Or you can do the same thing explicitly:
server.listen(8080, '::');

Here's the same basic thing in Go:

package main

import (
    "net"
)

func main() {
    // Binds to all IPv6 and IPv4 interfaces.
    // Note the square brackets around the address, since the 
    // interface is a subset of a URL.
    listener, err := net.Listen("tcp", "[::]:8080")

    // ...

The same thing is true in .NET, Python, Rust, Java and probably most other languages that aren't doing something weird in their networking implementation.

Of course, most languages also have lower level networking APIs that are IP version specific. If you're doing more complicated things with sockets, you may have a little more work to do.

Unfortunately, not all apps use dual-stack by default

Even though IPv6 support is readily available in most OSes and languages, it's not always enabled by default in every application. This was particularly annoying for us, because we use a lot of OSS and 3rd party container images as mock dependencies for integration tests, and supporting IPv6 meant we had to add explicit configuration for in a lot of places where we previously just used the defaults.

In most cases, the trick is finding the magic CLI arg, environment variable, or config file setting that controls the host to bind to, and setting it to ::.

warning

Some software (MongoDDB, Redis) goes out of their way to make :: not be a dual-stack binding. In those cases, you have to configure both the IPv6 and IPv4 listeners separately.

IPv6 cheat sheet

Here's a bunch of examples of various apps I've had to learn how to get working with IPv6:

aws-load-balancer-controller

You don't need to configure the aws-load-balancer-controller itself any differently for IPv6, but when creating Ingresses that use it, they need to have the following annotations to support IPv6:

# Tells the controller to create target groups of pod IP(v6) addresses. 
# The "instance" target type won't work on IPv6.
alb.ingress.kubernetes.io/target-type: ip
# Tells the controller to create a load balancer with IPv6 enabled
alb.ingress.kubernetes.io/ip-address-type: dualstack

The nice thing about this is that the load balancer itself will listen on IPv4 addresses (in addition to IPv6 addresses), which means IPv4 clients won't even know the app has been migrated.

warning

If you're using external-dns to create Route53 entries for your load balancer Ingresses, keep in mind that it will create both A records (for the load balancer's IPv4 addresses) and AAAA records (for its IPv6 addresses). This will change the behavior of any IPv6-enabled clients making connections to that load balancer, such that they may prefer the load balancer's IPv6 addresses over its IPv4 addresses.

This may be fine, but it is one way in which the "only IPv6 inside the cluster" model leaks. For example, if you have security groups on the load balancer, you'll need to make sure you're adding IPv6 versions of any rules.

ingress-nginx

In the helm values for ingress-nginx, you need to set the ipFamilies value to include IPv6:

controller:
  service:
    ipFamilies:
      - IPv6

MongoDB

Mongo binds to IPv4 only by default. You can get it listening to IPv6/IPv4 (dual-stack) interfaces with the following command override:

mongod --ipv6 --bind_ip ::,0.0.0.0

Here's an example of a Kubernetes pod:

apiVersion: v1
kind: Pod
metadata:
  name: mongo
spec:
  containers:
  - name: mongo
    image: mongo:8
    command:
    - mongod 
    - --ipv6
    - --bind_ip 
    - "::,0.0.0.0"
    ports:
    - containerPort: 27017
      protocol: TCP

More info: https://www.mongodb.com/docs/manual/core/security-mongodb-configuration/

Redis

Redis binds to IPv4 only by default. You can change it to bind to all interfaces with the following command override:

redis-server --bind "0.0.0.0 ::"

Here's an example of a Kubernetes pod:

apiVersion: v1
kind: Pod
metadata:
  name: redis
spec:
  containers:
  - name: redis
    image: redis:5
    command:
    - redis-server 
    - --bind
    - "0.0.0.0 ::"
    ports:
    - containerPort: 6379
      protocol: TCP

MariaDB

MariaDB 5.5+ already listens on :: by default, so no additional configuration is needed.

LocalStack

LocalStack currently doesn't support IPv6. However, I've opened a PR to add IPv6 support. If that PR gets merged, then you'll be able to use an IPv6 address in the GATEWAY_LISTEN env variable:

GATEWAY_LISTEN=[::]:4566

RabbitMQ

RabbitMQ listens on :: by default, so no additional configuration is needed.

warning

Note that while the rabbitmq:management image binds automatically to the main amqp port (5672) on IPv6, the management API (port 15672) does NOT bind to IPv6.

nginx

The nginx image's default config listens on both IPv4 and IPv6 by default.

If you're authoring your own nginx.conf, you need to add listeners for IPv6 and IPv4 separately. Here's an example of binding port 3001 on both IPv6 and IPv4:

nginx.conf

listen       3001; # IPv4
listen  [::]:3001; # IPv6

OTEL collector

The OpenTelemetry Collector config accepts the :: (all interfaces) address any place you could specify an IP address. For example:

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "[::]:4317"

Other OTEL collector components will automatically use IPv6. For example, the Prometheus receiver correctly uses IPv6 pod addresses when scraping metrics.

Jaeger

Jaeger already listens on :: by default, so no additional configuration is needed.

WireMock

WireMock already listens on :: by default, so no additional configuration is needed.

Gradio

Gradio binds to 127.0.0.1 by default. You can use the server_name property to set up an IPv6 binding in the launch() method in the Blocks object:

blocks.launch(inline=False, server_port=5112, share=False, server_name="[::]")

Uvicorn

Uvicorn will bind to all IPv4/6 interfaces if you set host='::' in the Config object:

ip_config = Config(app=_fastapi_server, host="::", port=8080)
return Server(ip_config)

More info

More IPv6 cheat sheet examples, please!

I'll be adding more IPv6/dual-stack configuration examples as I encounter them.

Do you have more? Leave them in the comments and I'll add them to the list!

What would an OSS developer platform even look like?

Mon, 23 Sep 2024 00:00:00 GMT

TL;DR

My team has built a developer platform that our developers really like, and is providing a ton of value for my company. But I'm struggling to figure out if and how we might open-source it. I'm looking for advice from you.

As a platform engineer, I enjoy the benefits of working in a field with a vibrant ecosystem of open source infrastructure and developer tools. I've spent much of the last decade building developer platforms by curating and assembling these tools, and after a number of iterations, I seem to have hit on something that's working really well for my current company (SimpliSafe).

As our platform's adoption has grown, we've gotten more and more frequent, really positive, heartwarming feedback from our developers who really like it. This is absolutely freaking delightful, and honestly never stops surprising me.

I often get asked by our developers if we should consider open-sourcing the platform. I've spent some cycles entertaining the idea, but I usually don't get very far before it seems unworkable.

This post is an experiment in thinking in public; I'd like to brain dump my thoughts on the challenges of building an open-source developer PaaS, in the hopes that the platform engineering community might provide some insight to get me past this block.

So, tell me more about this platform

Our platform is named "dex/EKS", which is (an admittedly awkward) combination of the name of the client tool, "dex", with the AWS service the server-side is built on (EKS: AWS's managed Kubernetes service). Unsurprisingly, developers tend to just call the whole thing "dex".

In the spirit of the "Platform Engineering" buzzword, dex/EKS encapsulates our company's collective opinions, policies, and best-practices for building, deploying, and operating apps. I like to think of it as a PaaS that we've curated and glued together out of a bunch of open-source and vendor tools.

info

dex (the client tool itself) is a CLI tool for interacting with the platform. Picture the flyctl, vercel, or heroku CLI.

dex is intentionally lowercase, or as I like to call it: "hipster-case". Or maybe camelCase without the humps? I dunno. It's a thing.

Our Kubernetes distribution

In addition to the client tooling, we also have a fairly sophisticated Kubernetes "distribution", which consists of a bunch of curated cluster-side components, combined and configured to work well together. Our cluster configuration is defined with Terraform, and we use it with Github Actions to manage many dozens of EKS clusters. Beyond that, there's integrations with a bunch of third-party SaaS providers, including AWS services and other vendors.

Just to give you a sense of the ingredients that comprise the platform, here's a partial list:

Kubernetes (AWS EKS)
Docker/BuildKit
Github Enterprise (with Github Actions)
Okta
Artifactory
Grafana Cloud
Honeycomb
OpenSearch (for logs)
A bunch of AWS services (ECR, S3, SSM, SecretManager, Route53, ACM, WAF, etc)

Here's a few of the tools (from the Kubernetes ecosystem) we use in our EKS configuration:

dex: the CLI tool

The CLI tool abstracts and integrates the APIs of these infrastructure components and exposes them through a simplified, declarative set of configuration and CLI commands. Some of the things it handles:

Configuration management (what settings your app gets in different environments)
Secrets management
Cross-platform and multi-platform container image builds
User authentication (i.e. SAML auth via Okta, to Kubernetes, AWS, and Artifactory)
AWS IAM integration (allows you to assign AWS permissions to your app)
Kubernetes manifest management (imagine a simplified version of helm)
Ingress management (load balancers, certs, and DNS)
Vulnerability scanning
CI/CD integration (Github Actions)
Telemetry pipeline integration
Docker/Kubernetes-based integration testing framework

One of the key metrics we hold ourselves to is that developers have to be able to get a new "hello world" service up and running in less than 10 minutes, at which point they can turn their focus to business problems. As they get closer to production, they have a few more decisions to make about autoscaling, observability, etc, but for the most part, the platform narrows down the choices to just a few fully-baked, meticulously documented, well-trodden paths.

dex works off-road too

dex has some other extensibility mechanisms for more advanced use cases, such as the ability to author custom commands with arbitrary TypeScript, which can re-compose existing dex commands and any of it's constituent APIs.

Teams sometimes use this extensibility to explore the frontier of what's possible. If they find a new pattern to be useful, we will often incorporate it into the platform.

For example, dex's multi-region DNS configuration support was originally built by another team, who then contributed it upstream, so everyone else in the company could use it.

The impact of dex at SimpliSafe

Teams at SimpliSafe have migrated the majority of our services to dex/EKS, and most teams are planning on moving their remaining services over in the next year or so. This has happened with close to zero pressure from management; teams are moving their services to the platform because they're much happier with it than without it.

Suffice to say, I'm very proud of this outcome, and dex seems to be providing a lot of value.

A platform design reflects a company's culture

While it may appear that dex is just a set of choices about tools and services that have been manifested in glue code, it also reflects SimpliSafe's values and organizational culture. While most of these values and cultural properties fairly well subscribed, they're by no means universal:

Teams should have a great deal of autonomy to choose tools, languages, frameworks, and processes, and they should be accountable for operating the systems they build
Continuous delivery is better than big-bang releases
Microservices are a good way for a large team to build a big system
A central team should own cross-cutting concerns like telemetry pipelines and observability backend tools, auth, infrastructure provisioning, etc
Infrastructure should be represented as code and managed through automation

There's a bunch of others, but you get the picture.

These values deeply inform the design of dex, and it's interesting to look back on the iterations of platforms that I've built at different companies with different values, and how they were reflected in the design of the platforms.

Example: ephemeral environments

Ephemeral developer environments are a feature I 100% know is a huge win for developers, regardless of your company culture. But there have been big differences in design, features, and implementation details when I've built this feature at different companies.

What are ephemeral environments?

Here's the gist: A developer should be able to deploy their app into an isolated, temporary environment from their local machine into Kubernetes, with a single command, so they can iterate, tweak, and test ideas in a production-like environment. They'll need a URL to access the app, so they can play with usability, do manual testing, attach a debugger, troubleshoot config, etc. When they're done, they use another command to tear the whole thing down.

Additionally, for every feature branch they push to Github (or Gitlab, Bitbucket, etc), a CI job will deploy an ephemeral environment with the same properties, run automated tests, and tear it all down when complete (or after a specified period of time).

Ephemeral environments in a financial institution

At my last job, in a commercial bank, we had a very small number of tightly controlled, multi-tenant Kubernetes (OpenShift) clusters. As you might imagine in a highly-regulated industry, the creation of Kubernetes namespaces (and the associated access controls) was governed by security controls, required approvals, and needed to leave an audit trail. The process was managed by a central team.

To allow creation of dynamic, isolated environments, we worked within the static structure of centrally managed namespaces by designing our tooling to generate Kubernetes objects using strict naming conventions (e.g. prefixing all resources with the name of the developer or feature branch). This allowed the tooling to manage the objects as a unit, avoid collisions, and ensure that the objects were cleaned up when the developer was done.

This design decision trickled into many other aspects of the system. For instance, we designed our client tooling to maintain pretty tight control over rendering the objects, and the relationship of objects via names, labels, and selectors.

Ephemeral environments at an IoT security company

At SimpliSafe, the company's culture and preexisting architecture enabled a very different approach: ephemeral environments are implemented via Kubernetes namespaces, and the client tooling can create (and destroy) namespaces dynamically.

Because we have per-team AWS accounts, and our Kubernetes clusters already provide strong isolation, we're comfortable giving developers the power to manage namespaces in our non-production environments. This removes a lot of the need for strict control over object relationships in Kubernetes, and gives developers more flexibility to mess with the underlying objects more directly.

This additional power is a reflection of SimpliSafe's culture of autonomy and trust in developers.

Different tradeoffs, different design

So even with the same feature, providing pretty similar benefits to developers, we had to make very different tradeoff decisions, and ended up with design differences which significantly impact the architecture, features, and feel of the rest of the platform.

What would this look like open-sourced?

Given the two examples of companies with different infrastructure opinions, let's think through the possible flavors of open-sourcing a developer platform like dex:

Option 1: A hyper-opinionated "PaaS in a box"

This option assumes that the infrastructure decisions we've made at SimpliSafe would be a good fit for at least a bunch of other companies, with minimal modification. We'd provide the whole thing, end-to-end, including the EKS cluster configuration and terraform, all the cluster-side system components, and the dex client-side tool.

I find this option hard to imagine for a few reasons:

While I'm very confident we've got a great solution for SimpliSafe, I think it's virtually impossible that any other company would be happy with all our opinions (the bank certainly wouldn't have been). Our platform glues together scores of specific OSS products (and a number of SaaS vendor tools), and the odds that every one of them lines up with another company's preferences is close to zero.
A platform engineering team using this version of the platform would be signing up to build expertise and support every OSS production we've chosen.
While out-of-the-box, opinionated platform might be good for a startup, our platform is certainly NOT the the right choice for a startup. It's designed around supporting many teams, and to allow a central platform engineering team to manage infrastructure underneath teams' apps... which is not the problem engineers at a startup should be worrying about.
Among the opinions encapsulated in our platform are some we're not happy about. We have a few compromises based on legacy infrastructure choices that are hard to change, and some choices which are an intermediate phase between where we are and where we want to go. For example, we're currently using the telegraf-operator to collect metrics for lots of our services, but we'd prefer to be using OTEL SDKs and/or Prometheus libraries.

I actually can't imagine myself choosing to use someone else's OSS platform if it were built on this philosophy.

Option 2: A whole platform, but pluggable

In this variant, we'd provide also provide the whole platform, but allow users to bring their own infrastructure opinions via a plugin API.

I also see some big disadvantages here:

Abstraction layers add complexity. Part of the value of dex is that the code is relatively simple, straightforward, and hackable. We often get a PR or feature request, and end up cutting a new release within hours. This would not remain the case if we started adding abstraction layers everywhere.
Testing and maintaining compatibility with all possible plugins would be a huge burden. Right now, dex's integration tests are both comprehensive and fast, and it would be virtually impossible to maintain this level of coverage if we had to test against an ecosystem of plugins.
It's really hard to build good abstraction layers, even for simple things. And these infrastructure components are definitely not simple. We'd be constantly expanding and modifying the APIs to support additional opinions, and the abstractions would inevitably leak.
Many of the infrastructure choices we've made allow us to simplify the design of the platform, and these simplifying assumptions wouldn't be valid if we allowed arbitrary plugins. Tight coupling, in this case, is part of the special sauce for creating a really streamlined and cohesive developer experience.
Comprehensive documentation would be much more complicated and far less useful, since docs would have to simultaneously support the perspective both of the platform developer as well as the end-user developer. dex has lots of docs based on developer use cases, and it wouldn't be possible to provide these if the whole experience were built on plugins.
Kubernetes APIs already provide so much power and extensibility, and especially when you throw in Operators, CRDs, and custom controllers, it's hard to imagine how I could provide APIs that would support all the flexibility Kubernetes offers.

I think realistically, this solution would start as option 1 and then abstractions would be gradually added by contributors to support their particular infrastructure choices, so it's probably best to think of this option as a spectrum with more or less pluggability.

Option 3. A toolkit for building your own platform

Another option is to factor out individual components of the platform as standalone libraries, and let people build their own platform. I could imagine some of dex's components being useful for someone who wants to build a different opinionated platform.

One example of a generally useful component is our config system:

Our config schema is defined as a tree of TypeScript classes, which can be used to generate a JSON schema, which can be used by other tooling to provide instant validation (e.g. via VSCode's JSON schema integration), to validate at runtime, an also to generate documentation.
The config system supports defining arbitrary target environments, which can use inheritance (and other mechanisms) to share common settings, and override them as needed.
It has a mechanism for declaring dynamic config values (e.g. a value from a Parameter Store secret, or based on the current git branch name, etc).
The config loader returns a config tree object which is built out of JavaScript Proxy objects, which allows us to do very smart validation, with user friendly error messages, and play to TypeScript's strengths.

That said, turning this config module into a separate npm package would have some tradeoffs:

The inherent packaging tax: working with multiple npm packages is more complex to develop, debug, and test locally.
It's abstractions would feel leaky to a user. For example, JSON schema generation requires some special build configuration, and this would appear a bit finicky if it was intended to be used off the shelf.
There's some other aspects of our config system that are currently tightly coupled with other parts of the platform. This is all just code, so of course we could figure out how to decouple it, but there would be a decent amount of net-new complexity as a result.

More generally, I think the challenge with this approach is that most of the value of the platform stems not from the individual components, but from their integration. For example:

dex's packaging/distribution mechanism (a stable CLI + a fast-moving, versioned library) has many moving parts
dex's own build system is very sophisticated, and has a lot of features around building canary releases, and enabling debugging in a sample host project
dex also has a suite of integration tests are fairly involved and comprehensive
The documentation of dex's UI (both its command line interface and config interface) is a huge factor in dex's success at SimpliSafe, and would have to be built from scratch for a new platform.

A plea for help

So I'm sitting on this great set of tooling, which is providing a ton of value for my company. It's built on OSS, public cloud, and SaaS services, and there's no proprietary magic or novel intellectual property we're trying to protect. It solves a problem that a huge number of medium-large technology companies would have to tackle.

Why can't I see a way to share this with the world? Maybe I'm just not being imaginative enough. I'm 100% certain this isn't a novel situation.

What do you think?

Building culture is hard, sustaining it is harder

Tue, 06 Aug 2024 00:00:00 GMT

TL;DR

I experienced first-hand what it was like to work in a company with a really strong culture of knowledge management, and watched what it took to build and sustain it. I also witnessed the factors that caused it to eventually crumble.

My current company is struggling with some challenges that are pretty typical for a wildly successful startup that's rapidly grown into a medium-sized company. We've got the expected technical debt, organizational design challenges, and a seemingly infinite number of small systems that work great... until they catch fire as we hit new scaling thresholds.

I've been through this a few times before, and none of this is worrisome to me. It's exactly where I'd expect a company be after a period of frenetic growth. In a lot of ways, this is a really fun stage; if you're someone who can tolerate a bit of chaos and ambiguity, there's tons of opportunities to shape the direction of a company.

On one hand, at a startup, your priorities are dominated by survival and existential risk. But once you grow beyond a certain size, you begin to lose leverage as cultural inertia takes over. A scrappy, determined visionary can exert a lot of leverage during the awkward teenage years of a mid-sized company.

I want to tell a story about how I watched such a visionary person, at this moment in a company's lifespan, solve a particular problem around knowledge management, and how he drove a really compelling, sustained, positive cultural change.

Yeah, and also how it eventually collapsed, and why.

Someone should go ask Rob why this one button is purple

When I joined Vistaprint, it was already very successful and growing rapidly. It'd been a few years since I'd hopped jobs, so I was a bit nervous and eager to dive in and start getting shit done. This turned out to be a bit of a longer process than I had hoped.

While Vistaprint had a number of cultural virtues, one very annoying feature was how much it relied on oral tradition, and a few linchpin people who functioned as repositories of all knowledge. Seriously, there were like 5 or 6 people in the company that collectively held about 80% of the total knowledge, and it was getting increasingly hard for them to get any work done as the number of people whose questions they needed to answer was growing.

Here's some of the kinds of things I'm talking about:

how to get your developer environment set up
the reasoning behind a particular software system's design
what UI ideas had been tested, and if they had succeeded or failed
why a particular compiler flag was used
how you'd get a new feature toggle enabled for analytics
who the hell owns any particular piece of code

It was also more than just technical knowledge. HR policies, business strategy docs, organizational charts, holiday schedules; if these things existed, they were stored as a Word doc on someone's share drive, and were impossible to discover unless you already knew what you were looking for.

Dan builds a wiki

My colleague Daniel Barrett (who you may know as a prolific author of some fantastic books on Linux and other topics) was already a veteran at Vistaprint when I joined, and was notably one of the "grownups" amidst a bunch of young upstarts. As part of his more general management duties, he had volunteered to figure out what how we were going to train the raging torrent of new hires that seemed to be showing up every day.

Dan recognized that training was actually a subset of the larger problem of knowledge management, and convinced the technology leadership team that we needed a more fundamental solution. He got to work on some ideas and started building a small prototype.

Not long afterwards, Dan presented at a technology team all-hands meeting, and introduced us all to a new wiki system (which at the time was called "TechWiki") that he'd built on top of MediaWiki (the software that powers Wikipedia). He had already seeded it with a straw-man categorization (based on his personal knowledge of the systems), and a number of stub pages. He explained his intention that we should all just start using it to write stuff down, and not worry so much about organization. He was going to work with teams to watch and learn how it was being used, and help coordinate efforts as it evolved.

There was a lot of skepticism across a number of fronts. Here are a few of the major objections that I remember:

Everyone had full access to edit any page, at any time. How would we prevent people from screwing up each other's content?
Without a strict, hierarchical taxonomy, wouldn't everything just spiral into an unmaintainable mess?
Developers aren't particularly known to enjoy writing documentation. How are we going to get people to take time away from engineering to write stuff down?

Dan had the wisdom to respond to these questions with the only correct answer: he didn't know. We were just going to try some stuff and see what happened.

The wiki takes off

It took a few months, but we started to notice that adoption of the wiki seemed to be growing pretty quickly. Daniel would pop by to check on us, let us know he had been reading what we'd been writing, and had some tips on how to use the wiki more effectively. He'd give us suggestions on organization, tone, and some general style tips.

He also encouraged us to be less "precious" with the wiki. He said it was a great place to take meeting notes, add team-specific content, and that we should feel free to create stub pages for topics we wished existed (but didn't know much about ourselves). What's more, he said we shouldn't hesitate at all to edit articles when we found missing or out-of-date information, regardless of whether we were the original author, or even if we didn't have any specific expertise or claim on the topic.

In the beginning, I'd often get email notifications that Dan had made minor edits to pages I'd created, usually normalizing titles, fixing typos, or adding category tags. It wasn't very long until I started noticing other people editing my pages. At first, I'd look at every change with some suspicion, but as it turns out... everyone editing my pages was actually doing a great job. They were genuinely improving content I'd written and adding details I'd missed. Even if the content they added wasn't 100% right, it was at least useful for me to know what answers they actually wanted, so I could correct any errors.

Meanwhile, while Dan continued his evangelization, he was also building a team that was driving improvements to the wiki software itself. We got more integrations with our other systems (e.g. JIRA integration, queries into our analytics database, links to shared drives, etc), and better search indexing. They were also working furiously in the background to keep the content categorized in a sane way, fixing typos and structural inconsistencies, adding searchable summaries, and making it possible to relax the stress on authors, while also maintaining some semblance of consistent organization.

Fun fact

From this experience, Dan later wrote the literal book on MediaWiki.

Imagine a world where all the shit is written down

This momentum built on itself, and accelerated exponentially. It wasn't that much longer before the wiki became the go-to place for everything. The most notable effect of this change was that when you found yourself with a question, instead of looking for the expert, the first thing you'd do is just look it up on the wiki. It was shocking how often you'd find the answer, and if you didn't, you'd go find the answer offline, and then go and create a damn wiki page about it.

In stand-ups, team members would remind each other to update the wiki page for that system they just changed, process they added, or question they couldn't find the answer to yesterday. Managers would look at the volume (and quality) of wiki contributions when doing quarterly reviews (I remember one time I was the 2nd or 3rd biggest contributor, and was very proud).

This success was so significant that the rest of the company (outside of the technology team) took notice. There were originally concerns that MediaWiki's content editing interface, which required learning Wikitext (a content markup language similar to markdown) was going to be too much of a barrier for our non-technical colleagues. This turned out not to be a big deal, as it was pretty easy to learn just the small subset of features you needed to create to create most content. It wasn't much longer before "TechWiki" was rebranded to "VistaWiki", and was adopted across the whole company.

For someone who hasn't spent time in a company with this level of knowledge management practice, it's hard to describe how positive it was. I have little doubt that the wiki provided us a real, tangible competitive advantage. Experienced new hires inevitably noted how useful the wiki was, and how much it contributed to their effectiveness.

The era of VistaWiki lasted for about a decade, across several transitions of leadership, and through a period of really huge growth. All the while, Dan was working in the background to keep this culture alive and healthy.

Entropy affects culture too

A few years before I left, I noticed that something had changed. The wiki was still there, and Dan and his team were still plugging away, but it became increasingly obvious that employees, more and more, had started taking the wiki for granted. As employees turned over, the wiki culture was emphasized less and less to new hires, who became gradually more likely to try to find an experienced colleague instead of searching the wiki first.

Unfortunately, our tech leadership at this time, who had inherited this culture but didn't fully grasp its significance, wasn't spending much energy on preserving and promoting it. To be fair, they had a lot of other fires to fight, and its understandable that, like an increasing share of the population, they took it for granted too.

I was part of (middle) upper management at this time, and though we were spending a lot of energy on nurturing culture, knowledge management wasn't included in the features we were supposed to be promoting.

Some teams across the organization started experimenting with different documentation systems; ones that had features that MediaWiki lacked, or supported documentation that was generated from code, or were just more familiar to them from previous jobs. Most of these were implemented in a half-hearted way, contained siloed information, weren't well maintained, and didn't uphold the core value the wiki was built on: that knowledge should be free, transparent, and discoverable across the whole company.

About a year before I left, Dan popped over to let me know he was leaving, and was off to do something new. I was sad to see him go, but not particularly surprised. He'd been working tirelessly on this stuff for years, and was clearly exhausted trying to keep this thing alive in spite of a leadership that wasn't fighting very hard to keep it from disintegrating.

leadership is hard too

I want to give the Vistaprint leadership of this era some grace, since I've come to know how incredibly hard it is to optimize investment decisions across so many different dimensions. I don't mean to disparage them, but I think its important to acknowledge this as a major factor in the loss of something that was really special and important.

After Dan left, the degradation accelerated. All the same factors that contributed to the exponential adoption seemed to be working in reverse. At one point I remember someone sending me a Word doc on Slack for something that would have 100% been a wiki article a few years before. It turns out, while this person had used the wiki, they weren't sure how to create an article, and didn't know if anyone would ever look for it there.

This moment reminded me of stories of medieval British peasants, walking past the ruins of Roman aqueducts, wondering where these strange, huge structures had come from, and what kind of people could have built such grand, otherworldly things. Maybe they had been built by giants?

Tips for building your own culture of knowledge management

I think about this experience as just one data point in my broader understanding about what it takes to build (or change) a company's culture. To avoid overstating my point, I'll try to stick specifically to knowledge management.

Here's the lowdown:

Minimize friction for contributors

The experience for contributors needs to be as frictionless as possible. Every barrier you put in front of a potential contributor (having to create a PR, having to follow specific procedures, feeling like they have to ask permission) gets multiplied across every person in the organization.

Culture change is always an uphill battle; don't add unnecessary weight to your pack.

Access control is counter-productive

Don't fool yourself into thinking that locking your shit down is going to improve quality. Access controls are some of the worst kind of friction; they incentivize information silos, and discourage potential contributors that approach content with an outsider's perspective.

Diversity of perspective is a thing

Don't underestimate the power of an outsider perspective: people who aren't marinating in your specific team's minutiae will spot your implicit assumptions, opaque jargon, missing details, and errors that accumulate over time... especially when you're documenting an evolving system like software.

Empower everyone to edit everything. In any reasonably healthy organization, the number of people creating good, quality content will vastly outnumber the people messing stuff up.

Either way, if you have employees deliberately producing bad edits, you've got bigger problems than knowledge management.

OK, fine.

OK, so some content legitimately requires access controls (e.g. HR/legal policies). Sure, lock that down. Just be really careful not to slide down that slippery slope and start locking down content because you have abstract notions of ownership or quality control.

Don't be precious with content

Avoid imposing implicit, social friction. Encourage contributors not to worry about taxonomy and organization; it creates unnecessary stress which discourages contribution. A pristine taxonomy is worthless without good content.

Also, don't be too pedantic about style, structure, or tone. If you're successful with adoption, you'll have too much content to ever have manually reviewed, so quality is something that inevitably will have to be addressed via culture (and not by creating speed bumps). If you get it right, people will want to write good content by virtue of social incentives, and because they get positive feedback from their peers and managers... or because they want the content for their own team or future selves.

In this situation, coaching and continuous feedback is much more effective than gatekeeping.

warning

Dan pointed out that he thinks a big factor in the success of Vistaprint's wiki was the consistent effort his team put in as editors: keeping things tidy and organized behind the scenes. This effort was obviously a significant cost, but it enabled them to promote the culture of "don't worry, just write it down" which was a huge part of the magic that made it all work.

Prefer a single source of truth

From the perspective of information consumers, you really want the knowledge repository to feel like a single system. You don't want users to have to ponder which of 7 different intranet portals to visit, depending on the type of document. If you don't have an actual unified knowledge management system, a good solution might be a unified search portal, or even just a norm of linking any external content from your primary system.

While a single source of truth isn't necessarily as important for contributors, I haven't yet seen a system with multiple sources of documentation where the fragmented experience ends up discouraging contributors. That said, I'd be curious if there's something that could be done with a federated system (e.g. a central system that scrapes individual sources, but generates content that contains links to edit the source).

You need a jump-start to get critical mass

Knowledge management culture has a chicken/egg paradox component: consumers won't use it if you don't have sufficient content, and it's hard to incentivize contributors to create content unless they believe it will be used by consumers. You need to find a way to prime the system and get it to become self-sustaining.

This probably requires a few different strategies deployed in parallel:

Recruit key, influential people (probably the people who are already information bottlenecks) to start creating content
Practice sustained engagement with teams. Use this to help encourage content creation, remove implicit barriers, and identify sources of friction or stress.
Consider ways to seed the new system with content pulled (probably via automation) from existing sources. Even if this isn't perfect, there's something about seeing incomplete content that gets people to want to fill in the details.

And probably a bunch of other you'll need to discover incrementally along the way. Much like developer experience is a product, so is content contributor experience.

You need a team to sustain and nurture it

Probably most importantly, you'll need someone (and probably a team) championing and driving the culture change. Some companies refer to this role as a "Librarian", but I think it's a lot more than that title evokes. While it is important that this person/team has good instincts around information architecture, it's much more about being an effective evangelist and relationship builder. It's fundamentally about changing the behavior of a large group of people, and that's legitimately hard.

This ongoing battle to sustain culture has two fronts:

Grassroots: contributors have to fully buy in, and understand that the effort they put into creating content benefits them in short order.
Managing up: If upper management doesn't understand the value of a knowledge culture, it's going to be very hard to sustain it. The management hierarchy, at every level, has to be reminded how much they're getting out of the culture, and that they're responsible for keeping it alive.

Let's extrapolate

It might be worth making extrapolating a more general principle here about culture change: a great company culture can be a major driver for business success, but culture is a delicate, fickle thing. It can wither and die just as quickly as it was grown.

As leaders, we make decisions all the time about where to direct our limited resources. If there are aspects of your company culture that you believe really matter, take stock of what investment you're actually putting into it. I'm talking real, tangible resources; if you're not making real tradeoffs to sustain it, then you're choosing to let entropy take over. You may one day wake up and find that your teams are behaving in ways that are very counter to the culture you thought you valued.

That one time I did something important

Sun, 02 Jun 2024 00:00:00 GMT

TL;DR

This is the story of the most impactful accomplishment of my career (building Vistaprint's Studio), which happened to be as an individual contributor.

For those of us who've actively chosen to remain active technologists, and have resisted the pressure to join management, it's important to remember that innovation is ultimately driven by individuals.

A commonly accepted notion in software engineering leadership is that managers have a much bigger potential for impact on a business than an individual contributor. This is certainly a credible argument, given that a great manager can have a huge impact through building a great team. They're responsible for recruiting the right people, steering the culture, and making the biggest decisions about what risks to take, what opportunities to pursue, etc. Ultimately, they're accountable for what the team delivers.

And of course, a big team can deliver bigger outcomes, with bigger impact, than any individual contributor could on their own. Given this leverage, its no wonder that it's very rare for companies to have pay scales for individual contributors that match that of managers, especially in senior management.

For those of us who find happiness and fulfillment in working directly with technology, our decision to avoid management can come with a significant economic penalty.

The penalty for optimizing your career for joy

I made an explicit decision a few years ago that I would leave management in order to get back to the things that originally drew me to technology. While I still think of what I do as leadership, I've come to terms with the fact that for the remainder of my career, I'm going to watch my former peers surpass me in titles, power, and especially in compensation.

At the beginning, this was a little hard on my ego, but over the past few years, I've come to a place of contentment. The amount I look forward to any given day of work is directly proportionate to the amount of uninterrupted time I have to work on engineering problems. I've decided my goal should be optimizing my career for happiness, so this tradeoff works for me.

But I want to push back a little on the idea that management is categorically more impactful than individual contribution. The concept is a bit of a tautology; managers take credit for the innovation and impact of the individual contributors they hire. But because of the nature of software, a single person with the right idea, at the right time, can manifest that idea into the world to great effect- sometimes without any organization supporting them at all.

I'd like to tell you the story of the most impactful thing I've ever done, which was as an individual contributor. Even though this was 18 years ago, I honestly don't know if I'll have another chance to do something quite like it.

Do you remember Web 2.0?

Back in 2006, I joined Vistaprint to work on the team that owned its "Studio" application. Studio is Vistaprint's "PhotoShop in the browser", where customers can customize and edit designs that will then be used on custom-printed products (most famously business cards, but they also have hundreds of other products). This was a client-side web app, written in JavaScript/CSS, with a backend built (at the time I joined) in VB.NET.

Let me just set the stage, and remind my readers what the web was like in 2006: Microsoft absolutely dominated the browser space since winning the so-called "browser wars" back in the late nineties. Chrome didn't exist. Firefox had been around for 4 years, but it held a fraction of the market share of IE (version 7 at the time), and had virtually no appreciable advantages in user experience over IE. JavaScript was widely regarded as a toy language, and the browsers' engines were all equally, painfully slow. Honestly, none of us ever imagined JavaScript could be faster.

There was nothing like modern web frameworks like React or Vue. Even frameworks now considered legacy, such as jQuery, YUI, and MooTools, wouldn't have their first releases until later that year. The leading JavaScript frameworks at the time were Prototype and Dojo. Flash was still considered the technology for interactive web applications.

Comparison

There are now lots of apps that allow users to do sophisticated design in the browser (e.g. Canva, Figma). We didn't have any of the browser technologies back then that make this possible today: the canvas tag, SVG, WebAssembly, WebGL, or even half-decent JavaScript engines.

We were building web apps with mud, sticks, and gumption.

One thing that had happened the year before, however, was the initial release of Google Maps. While scrappy browser hackers had done some really cool and innovative stuff before, the effect of the launch of Maps was like a bomb going off in the web development world. It felt legitimately groundbreaking, and it was obvious at the time that this app was going to change the world.

I'm just putting that picture in your mind so you'll have a sense of what we even thought was possible at the time.

The origin of Vistaprint's studio

Vistaprint's first Studio had been built back around 2001, by a brilliant hacker who embodied the type of contrarian scrappiness that was required to do anything on the web at the time. Despite the primitive browsers of the era, he was able to use a number of techniques, moderately well-known at the time (but not widely used), to do the kinds of things that web developers would later do with AJAX.

The most important of these hacks was to use multiply-nested iframes (and a makeshift protocol that looked a little like JSONP) to communicate with the server without requiring a navigation on the main page. This allowed him to effectively simulate AJAX requests before browsers even had the capability.

What's even crazier about this is that the client-side code for Studio was used in an insane (but also brilliant) hack, where they loaded instances of IE server-side in order to measure text bounding boxes, so that they could render documents for manufacturing. It turns out text rendering engines are really freaking complex, and given the fact that we relied on IE for text layout on the client, the most reliable way to ensure that customers' printed products were faithful to their browser renderings was to use the browser to render them on the server.

Hat tip

Despite all this hackery, I'm never going to criticize the folks that built what became a multi-billion dollar company. You do what you have to do to get a business off the ground, and Vistaprint's early team did incredibly creative and audacious stuff.

Studio in 2006

By the time I arrived in 2006, Vistaprint had built a very successful business on Studio, and had recently had their IPO, after delivering revenues of about $90MM in 2005. Studio was considered one of their major strategic "pillars", along with their novel manufacturing capabilities.

I spent the first couple of months working on Studio, trying to get my bearings, and wrap my head around what had become a pretty nasty mess of spaghetti over the past few years. Around that time, the team lead had decided to move away to follow his girlfriend out of state, and I was left in charge of an app I barely understood. It wasn't just me, though- my colleagues admitted that none of them had any confidence they could make any significant changes to the application without breaking things.

This is, as a matter of fact, exactly what happened... repeatedly. Vistaprint was growing its product portfolio, as well as trying to iterate to improve usability, and there had been a series of disastrous attempts to add new features to Studio to support this, each followed by an emergency rollback.

Beyond the maintainability challenges, you have to understand what the user experience for Studio was actually like. Because of all the complexity required to make the multiple-iframe communication mechanism work, along with years of features being layered on top, the user interface took a minimum of 60 seconds to load, even on a fast internet connection. About 20% of the time, something would fail (usually due to race conditions) and require a reload.

The app was deliberately styled to look like a Windows 95 desktop app, with CSS that had been carefully crafted to match the beveled edges, corporate grays, button styles, and fonts.

Studio only worked in IE. If you were unlucky enough to be using Firefox, Opera, or Konqueror, or you were on a Mac, you'd get redirected to a very limited, server-side, form-based page where you couldn't do any customization to your document other than edit the text.

The user interface was rife with bugs. We didn't have any real observability, but anecdotally, users would experience a blocking bug in at least 15% of sessions.

What's more, it was becoming less and less reliable to depend on IE to do server-side rendering, since over time, IE's text engine became more and more influenced by settings on Windows, graphics drivers, etc. We had a certain amount of documents that just got printed completely wrong, and had to be manually modified, by humans in our manufacturing plants.

Hitting rock bottom

It wasn't long before I caused my first major production incident by attempting a bug fix in Studio, despite having been through what felt like a very meticulous QA cycle. After the rollback, we calculated the losses from the incident at about $20K, and I felt pretty deflated. My boss helped to put things in perspective, noting that these kind of losses were common in Studio, and my predecessor had caused many such incidents.

I spent a couple days feeling sorry for myself, and then resolve set in. I was having none of this. This was not OK.

Kindling and a spark

Around this time, a couple engineers had been working on a new set of server-side text rendering services that we could use for simpler products that didn't require Studio (this was especially appealing at the time, because the conversion rate in Studio was so terrible). I saw a demo that they'd built, and found myself unable to stop thinking about it for several days. One evening, while trying to get to sleep, I had a crazy idea.

What if we could build a brand-new Studio from scratch, where the document's elements would be composed of a set of server-rendered images? The client-side code would just be an interface for moving, resizing, and opening an editor for these elements, which from the perspective of the client, would just be rectangles with a set of properties. These elements would be the same types of elements users could edit in the legacy studio (e.g. text boxes, images, vector shapes, etc), but we'd build server-side rendering services for each one, which would output transparent PNG images so they could be composited together on the client.

The user could then just double click on any of the rectangles to open an element-specific editor. So for text boxes, this would open a simple text-box editor, which would allow the user to type, and then we'd debounce the keypress events to trigger a refresh of the server-rendered text.

This way, the documents we produced client-side could be faithfully rendered server-side using the same text-layout engine, and we could remove a huge amount of complexity on the client.

The pitch

The next day, I spent some time with the engineers who had done the text rendering work, and we started working through the details of the idea. Once we felt like we had something viable, we brought my boss, Satish, into the conversation. Having a shared experience of pain with Studio, he immediately arranged a meeting with Wendy, Vistaprint's head of "Capabilities Development" (I think she was technically the CIO at the time, but was still directly leading the Engineering team).

I explained the idea to her over the next half hour, and left with permission to suspend feature development on Studio, and to work with one of the rendering guys for a few weeks to build a prototype.

Why not Flash?

In the following few years, I had to address a lot of questions from folks about why I'd decided to use pure HTML/Javascript instead of Flash. This was three years before the iPhone, and Steve Job's famous refusal to allow Flash to run on it. Flash was still considered by many to be the best choice for rich, interactive experiences.

The real reason we didn't want to use Flash was that it would have made us dependent on Flash's proprietary text rendering engine (like we were on IE before it) for server-side rendering. It also wasn't clear that it was possible to use Flash server-side, or that Adobe wouldn't change something at any point that would break our whole system.

This turned out to be another very fortunate decision.

Building the new Studio

Within a week, the two of us had a working version of Studio that could create a new document, and had some basic editing features, including text editing and drag/drop positioning for all document elements. The results were fairly stunning in contrast to the legacy studio:

It loaded in just a couple seconds
It worked in IE, but also Firefox and Opera. It also worked on a Mac.
It was smooth, snappy and responsive
The feel of the pop-up text editor, which we were afraid might be weird, was totally fine.

As soon as she saw the prototype, Wendy gave us the green light to go all in. I spent the next few months turning the prototype into a real replacement for the legacy Studio, adding support for each of the element types needed to support our most important products, including business cards and postcards. My colleagues on the rendering side worked on building a new version of the program that would transform Studio documents into press-ready PDFs, using the new server-side text rendering engine (and NOT using IE).

Vistaprint's Studio, today. The code has been rewritten, but it still uses the same architecture I helped create in 2006.

We launched via an A/B test shortly thereafter. Most A/B tests run for changes to Studio for the last few years had either negative or statistically insignificant results, and despite how much better our new version felt, we thought the odds of hitting a home run on the first pitch were pretty low. We would have been happy with breaking even- at which point we would have been able to take advantage of the more maintainable codebase, and focus on optimizing.

When the first A/B test came back, we were floored. Conversion rate was up by about 5 points. Out of the gate, this hack was immediately worth tens of millions of dollars a year for Vistaprint, just for one product! And we had done zero optimization.

The results

Over the next year, the effects of this success rippled throughout the company. A large number of engineers were redirected to execute changes throughout the system needed to replace the old IE-based document rendering with the new server-side rendering engine. A team was built around me to keep developing the new Studio, and we gradually added the features needed to support an increasing share of Vistaprint's product portfolio.

All this time, Vistaprint was growing like crazy. Each time we'd move a product over to the new Studio, we'd see a huge jump in conversion rate. New core capabilities were being built on top the new Studio architecture, and a ton of new design content, products, and features were enabled. The process of rendering documents for manufacturing was far more efficient now that we had a reliable way to render documents that were faithful to the users' intentions, and we no longer needed a small army of humans to fix broken documents.

Every year we were growing revenue by hundreds of millions of dollars. It was an incredible ride.

Reflecting on this success

I want to be clear that Vistaprint's success was due to many critical innovations and an enormous amount of work by many, many people, in areas like manufacturing, content design tooling, marketing, ecommerce, etc.

Also: even though I had come up with the core idea for the new Studio, it was based on many of the ideas from the old Studio, which itself required a lot of independent innovations. Beyond that, there's no way I could have come up with this idea, or had any hope of making it work, without the insight and skill of my teammates who had figured out the server-side rendering.

What's more, none of the engineering work I did was groundbreaking or mindblowing. I just synthesized some disparate ideas, from both inside and outside Vistaprint, and glued it all together with some (fairly decent) JavaScript and C#. I was just the right person, at the right place, at the right moment, with the right idea.

Even so, I still occasionally catch myself dwelling pridefully on this achievement. I imagine an alternate universe where I never joined Vistaprint, where they tried to incrementally improve the old Studio architecture. I don't see how they could have had the success that they did in this universe; the difference might be measured in hundreds of millions (maybe billions?) of dollars at this point.

I've done a lot of things I'm proud of since then, but I don't know how likely it is that I'll ever play such a pivotal role in building a multi-billion dollar company again.

Innovation comes from individuals

Thinking back on this episode in my career has been useful to remind myself that impact, especially via innovation, is ultimately driven by individual contributors. This is really important to remember for those of us who've chosen to optimize our careers around the joy of being a technologist, especially when the social and financial pressures to advance our careers through management are so potent.

My contrarian thesis aside, I have to acknowledge the complex interrelationship between ICs and managers when it comes to innovation. Technology leaders play their part to drive innovation by actively building a culture of empowerment and risk-taking, and being willing to make big bets on individuals with vision.

Perhaps it was only implied in my story, but this is exactly what Wendy had done for Vistaprint, long before I had arrived. She built an amazing team, and fostered a culture where engineers felt supported, trusted, and safe enough to invest their time where they thought opportunities existed.

I feel a great deal of gratitude to Wendy for having taken a risk in empowering me. It was a truly formative experience, and I still cant believe how lucky I was that the work I did had such a lasting effect on a great company.

Developer experience is a product

Sun, 26 May 2024 00:00:00 GMT

TL;DR

The most important feature of an internal developer platform is that the team that builds it has to compete to win over their users.

Figure out your initial value proposition, build a minimum viable product, get it in front of customers, listen, learn, and iterate.

Platforms imposed by a top-down mandate tend to fail.

Over the past 15 years, I've been working on one form or another of internal developer platform. Even long before, while working at small startups, I inevitably ended up building (or curating) some little web framework, a build system, and slapping together scripts to package and deploy our stuff reliably. No one ever told me to do this, it was just obviously necessary.

In these cases, I was building a product for myself and my immediate team members, so it was a pretty tight feedback loop with the customer. I'd put a little extra effort to make things nice for other developers on my team, and also out of a bit of pride in making something that felt elegant.

Developer experience in the monolith

At the first larger company I worked for, I worked on improving the developer (and user) experience on top of a giant, pre-existing monolithic app, with a lot of custom tooling. One needed custom tooling when dealing with a monolith of several million lines of code being concurrently modified by 200 developers, especially since there wasn't really any off-the-shelf tooling available that could handle this scale.

Since there was already an established build and deployment system, I was mostly focused on improving the experience of web developers. At that time, the challenge was mostly around providing a sprawling army of mostly backend developers with a decent library of web UI components, and achieving some semblance of brand consistency.

This whole thing required some culture change, and a lot of outreach. I had no power to enforce usage of our web framework, nor any power to force web designers to work within the constraints we'd defined together. To get the designers on board, we needed to build some trust, listen to their concerns, and help them see we were trying to help them realize their vision with greater fidelity.

For developers, it just required that our framework was easier and better at helping teams make their pages look like what the designers wanted. Ultimately, no one ever had to force anyone to use our framework, it just made things easier for everyone, so they did it.

The next time we had to do a brand refresh of the site, it only took a couple people a week or so, whereas the last rebrand had been a major project across the whole company that took months. This was a small win against the entropy which was slowly devouring our monolith.

Microservice babies

A few years later, it was becoming apparent that we had been gradually losing productivity in our monolith, and there were some factions interested in pursuing a service-oriented architecture. A new platform team started working on a set of tooling to enable teams to stand up independent services outside of the monolith.

In a marked contrast to the proprietary infrastructure we'd been using for the monolith, they were toying around with a bunch of different open source and vendor tools. After some prototyping and getting an initial MVP built, some teams started using their stuff.

Unfortunately, the fate of this particular platform was to fizzle. In retrospect, there was a lot going against it:

We weren't yet using a public cloud (they were targeting on-prem infrastructure)
Kubernetes and containers were in their infancy
We were legitimately deluded about what it would take to make microservices actually work. Seriously, we were like little babies.

These were pretty strong headwinds, but there was another factor I can see in retrospect, which was even more critical:

The platform team didn't spend enough time learning from their customers, or trying to understand the actual problems they were facing.

I remember several specific tales of teams working on building services outside our monolith using their tooling, running into some friction, and experiencing something other than empathy from the platform team. At least once I remember a VP getting involved to put pressure on a team that was expressing reservations and looking for alternative approaches.

Caution: Fuzzy recollection

My recollection may not be 100% accurate, since it was a while ago, and I wasn't privy to all the goings-on. I have the impression the team gained the backing of leadership, who provided some degree of pressure on teams to adopt their tooling.

I'm not sure to what degree this pressure was applied in practice, but I do remember that the teams believed usage of the platform was expected of us.

Building a great platform for the wrong customer

What was wrong with the product? Let me explain:

The toolkit was built as a set of Ruby gems, referenced via a root gem that composed them to enable some higher-level operations. Each gem was responsible for interacting with some of the platform's parts, such as Artifactory, Jenkins, or whatever deployment tool we were using (I think at one point it was Octopus Deploy?). The tool would scaffold out a rakefile, with predefined tasks (e.g. build, deploy) the user could execute with the rake CLI. The user could then customize their rakefile, combining these gems to implement all the custom processes their service needed, but a lot of the low-level details would be taken care of within the gems.

Here's where the problems started: the gems were fairly course-grained, strongly opinionated, and had pretty limited extensibility. They were also composed and packaged in a way that made it hard to replace any single gem with an alternative implementation. The options for a user that had an unsupported use case were pretty limited:

Get the tooling owners to implement the feature
Implement the feature and try to convince the owners to take the patch
Implement the feature from scratch outside of the pre-existing gems

hindsight

I think that monkey patching may also have been an option. I don't think any of us had enough Ruby experience to know that was a thing.

All of these options were particularly unappealing, partially because there was virtually no Ruby experience to be found among our developer population. But more importantly, after a number of teams' feature requests were met with apathy (and a bit of paternalistic "you're doing it wrong"), the "brand" of the tooling began to suffer.

Despite strong top-down pressure to use the tooling, teams openly rebelled and began piecing together their own bespoke solutions. In the long run, management gave up the fight, because ultimately, they just cared that business problems were getting solved.

In retrospect, I think leadership's decision to push the tooling was its death knell. The platform team lost its incentive to win the trust of the developers, and got caught up in their own vision. They built a beautiful product, it just didn't happen to be the product we needed.

Reality check

I should note that despite how things turned out with this particular platform, I can't deny that this team's work had a huge influence on the way I've thought about platform engineering ever since.

This is one of the rare occasions on which I've had the wherewithal to learn from others' failures instead of my usual approach of repeating all the same mistakes myself.

Kubernetes emerges from the chaos

A few years and a regime change later, we had a whole lot of teams individually managing their own build/deployment tooling. This was, in no small part, a reaction to the bad experiences many of us had with the aforementioned platform team. It seemed obvious to me at the time that there was a lot of waste in having every team have to rediscover their own solution, but I also acknowledged that the alternative of a central platform team managing this for everyone hadn't worked so well last time.

Meanwhile, management had blessed adoption of AWS. At the time, we had a vague and naive idea that AWS was a ready-to-use platform. and hadn't yet come to terms with its true character: an extremely powerful, but low-level set of primitives. They had a few offerings at the time that looked a little like a PaaS if you squinted, but we seriously underestimated the amount of boilerplate glue scripts we had to write to, for example, get a service built and deployed on ECS or Elastic Beanstalk.

One team in particular had been toying around with Kubernetes and was having some success. While I'd used docker a bit, and had been following the orchestrator wars (mostly as a lurker on Hacker News), I didn't yet see what the big deal was. But smart people I respected were saying good things about it, including words I liked, like "rolling deployments", "autoscaling", and "self-healing".

I had just spent the previous 2 months trying to help another team, who had been struggling to execute a basic blue-green deployment with CloudFormation. Then I saw a demo in which a kubectl apply of a single deployment.yaml file executed a seamless rolling update of a service within a minute, and I was sold.

As I learned more about the abstractions Kubernetes was built around, my thoughts returned to the idea of creating a developer platform. It seemed possible that containers and the Kubernetes API might be the membrane we needed to give developers autonomy over all things they cared about, while enabling central management of the stuff they didn't. The ingredients of the devops stew were finally all out on the counter.

It took some convincing, but I managed to get some of the influential developers on board with the idea that we'd create a new platform team, and attempt to build a scaled-up, multi-tenant version of the Kubernetes solution they had pioneered. We started the team, and spent most of the first month learning how to build and operate a cluster with kOps (EKS either didn't exist yet, or was too new to consider seriously).

We got a couple of the teams to try it out, and found that it was, indeed a Kubernetes cluster; it allowed us to define workloads and roll them out reliably. This was a huge improvement. But it didn't take long until the teams using it had accumulated a bunch of shell scripts and additional tooling to manage a few other things:

Authentication to Kubernetes, Artifactory, and other services
Running docker builds (passing in build args, ssh-agent sockets, managing cache volumes, etc)
Defining per-environment configurations
Syncing secrets between our secret store and Kubernetes
Managing load balancers, DNS, and certs
Orchestrating integration tests with a bunch of docker containers

Once again, each individual team was re-solving the same problems, each with their own flavor of tradeoffs and bugs. Clearly, we had a lot more opportunity to provide value here.

Coincidentally, I was fighting a little burnout around this time, and ended up deciding that 13 years was long enough in one company. I never got to take this particular platform further, but the shape of the problem space had become a lot clearer in my head.

Three developer platforms later, lessons learned

Over the next 7 years, I've iterated on this idea three more times at two different companies, all built on Kubernetes. The results have been increasingly compelling with each iteration, and I've added a lot of key elements to the approach. The central idea has become:

The platform encapsulates the operational, cultural, and security opinions of the organization, gluing together the company's chosen infrastructure and tooling.

There are a lot of principles and patterns underneath this high-level idea, but there are a few, universal key dimensions along which you have to strike a balance:

Finding the right line between things that have to be standardized, and things where there's value in flexibility and autonomy for teams.
Adding enough power so that the platform can support all the use cases in your company, while also having a small number of simple, default paths that work for the vast majority of cases.
Creating the right extensibility points, allowing teams to solve their own problems without the platform team being a bottleneck, but still maintaining enough coherence in the core aspects of the platform so it can evolve and improve over time.

The right balance in these dimensions is highly dependent on the culture and values of your organization, but there's one thing I'm pretty sure is universal, which is how you find that balance:

Treat your developer experience like a product.

Finding product-market fit

I don't think this is any different than a startup would approach things:

Observe your developers, listen to them, learn about their pain
Formulate hypotheses about how you can alleviate that pain
Build a minimum viable product
Get it in front of developers
Listen, learn, and iterate
Pivot if what you're building doesn't resonate

Don't fall in love with your own vision. When developers ask for a feature, don't dismiss them, even if you don't see where it fits on your roadmap. Regardless of the implementation details they may be stuck on, they're giving you critical information about their pain points. Don't squander that opportunity.

If you're exceptionally visionary, you may have innovative, paradigm-shifting ideas for solutions that developers don't even know they need. That's great, but you should slow your roll. Use the scientific method: test and learn. Maybe you have the wisdom of Solomon Hykes, but the odds are against you. In reality, 99% of ideas you think are novel aren't actually new, they just got quietly discarded because they didn't work in practice.

For a internal developer platform, you don't have to be particularly innovative, and you certainly don't have to be original. In fact, it's usually a lot better to shamelessly steal ideas from successful platforms outside your company. Bias towards open-source tools, vendor or cloud-provider products. Rip off the CLI interface of a popular PaaS product your developers are already using for their side hustle.

Congratulations, you're a brand manager

And like a startup, you're also the steward of your product's brand. You have to earn trust with your customers, show them that you're listening to their feedback, and that you're committed to making their lives better.

Your brand is also relevant to stakeholders besides your direct customers, including leadership, security teams, product owners, etc. If they don't understand your value proposition, they'll a good chance they'll be asking uncomfortable questions at a moment when its least helpful.

Remember, you don't have the monopoly you think you do

You don't want to be this guy.

In a few cases, especially with the latest developer platform I've worked on, I've had to fend of requests from leadership who'd like to accelerate adoption by cranking up pressure on teams to use our stuff. Certainly, there are benefits to the organization in standardizing (especially for security and cost management). But each time I push back.

For one thing, we haven't needed to do anything to drive demand; teams are migrating services whenever they can spare a sprint... because they like what we've built and they know they have a say in the direction we take it. We're going to get to 100% adoption at some point soon, and we won't have ever forced anyone's hand.

I think this principle is pretty universal for teams working on internal tooling. When you're tempted to use management to force people to use your product, step back and consider the big picture. You don't have the monopoly you think you do. Companies evolve and change; new executives and managers come into power, technologies evolve, and the business climate changes.

If you want to stay on top, you have to acknowledge that you're always competing for your customers' business. If they're happy with the platform, and trust you to keep improving it, they'll defend you against shifting tides. If they're not, they'll abandon ship as soon as another option presents itself.

Prometheus vendor death match

Sun, 12 May 2024 00:00:00 GMT

TL;DR

We evaluated a number of observability vendors, with a focus on metrics, and did detailed PoCs with both Chronosphere and Grafana Cloud. Both are excellent products, and have slightly different strengths.

At work, we're in the process of rebuilding our metrics pipeline, as we've outgrown our old self-managed TIG (Telegraf, InfluxDB, Grafana) solution. We've had this solution in place for many years, and it's served us well. Especially given the increasingly predatory pricing models of observability vendors, it's been extraordinarily cost-effective.

But over the last couple years, as we've grown, we've started to hit the limits of what we can handle with a single, vertically scaled instance of InfluxDB (especially using InfluxDB v1). It was increasingly stressful to keep it running smoothly, and we had to be very vigilant about cardinality, as it's very easy to accidentally introduce a cardinality explosion that can bring down the entire database.

Just upgrading InfluxDB would have been similar in scope to moving to a new vendor, since InfluxDB v2 has a new query language, and we would have had to rewrite all our queries and dashboards anyways. So we decided to take the opportunity to do a proper RFP, and see if we could find a vendor to take this entire problem off our plate.

The landscape

We decided to focus specifically on metrics, since we needed to limit the scope of our evaluation by some criteria. Observability products vary drastically across many dimensions, and metrics were the area where we were in the most pain.

The vendors we considered fell into a few categories:

The dominant players

These include both Datadog and New Relic, which are both well established and very feature-rich. They're also known for being extremely expensive, and having pricing models that are difficult to predict or control. I've talked to some friends who've worked with them, and they said although they were great products, it was pretty typical that the cost would be 30% over an already bloated budget every year. But because they were so locked in, every year they'd have to renew, and the sales people would show up looking for another pound of flesh.

Another thing we noticed about the dominant players was their transparently conflicted relationship with Open Telemetry. While OTEL support features prominently in their marketing materials, their documentation tells a different story. Customers who choose to instrument their systems with OTEL SDKs will find that they're missing a whole lot of the best features of these products. The sales folks were not exactly subtle about recommending we run the evaluation using their proprietary agents instead.

Yuck on both fronts.

The up-and-comers

There's a few interesting, smaller players in the space. We looked at SigNoz, Logit, and a few others. They all appeared to be offering basically the same thing: a hosted, Prometheus-compatible backend along with a Grafana-based front end. They all had very competitive pricing, but we felt a bit concerned at how immature they were, and decided against doing a full evaluation.

The cloud-provider option

Since we're an AWS shop, we also considered using AWS's Managed Prometheus offering, which would have simplified some of the operational complexity of running a Prometheus backend (e.g. Thanos or Cortex) ourselves. Doing some back-of-the-napkin math, we realized that, if we didn't do anything differently, we'd end up spending about 3x what we were currently spending on InfluxDB. Plus, we'd still have to manage our own Grafana instance.

The goldilocks zone

We also looked at a few vendors that were in the middle of price distribution, such as Chronosphere and Grafana Cloud. These were also Prometheus-compatible, with Grafana front-ends, but both companies were reasonably established, and had similar looking feature sets.

Chronosphere was the first vendor we decided to evaluate, because their sales pitch included something we hadn't heard from any other vendor; they'd provide a way for us to manage costs with powerful, centralized ingestion controls, an as a result, could offer us predictable pricing.

This piqued our interest, not just for managing costs, but because we'd long had problems with cardinality. At any moment, cardinality from any given service could unexpectedly explode- based on the decisions of a single programmer. For example, we've occasionally had instances of programmers inserting a metric label where the value is a unique ID (such as a customer ID), which could have hundreds of thousands of possible values. Previously, we had to be extremely vigilant about this, and pounce on any team that introduced cardinality explosions before they could bring down our InfluxDB backend.

So having a way to manage cardinality, centrally, was very enticing.

Evaluating Chronosphere

We decided to proceed with a PoC of Chronosphere. We started with some changes to our metrics pipeline infrastructure, adding OpenTelemetry Collectors to help capture and redirect our current metrics data (which was coming mostly from telegraf), so that we could send metrics to both our in-house InfluxDB and Chronosphere concurrently.

We had, as part of a previous set of experiments, already set up some common Prometheus metrics infrastructure in our Kubernetes clusters, including kube-state-metrics, node-exporter, and cadvisor. We were able to easily point these at Chronosphere as well.

The sheer volume of metrics

The first thing we realized was that we were sitting on an enormous amount of cardinality. Chronosphere reported that we were generating over 8 million active series, and our sales engineers were a flabbergasted about how we were even able to handle all of it with a single InfluxDB server.

Fun fact

Actually, every vendor we talked to was certain we were mistaken when reported to them we were handling 8-9 million active series in a single InfluxDB; they assured us that this wasn't possible.

And yet, somehow we were doing it.

The Chronosphere control plane

Our sales engineers immediately got to work helping us learn how to use their "control plane" feature, which allows you to write fairly arbitrary rules which can select metrics by virtually any criteria (names, label values, and/or combinations based on boolean expressions), and perform complex transformations on them, including:

Drop them entirely
Aggregate away high-cardinality labels
More complex transformations, such as changing the temporality of the counters (e.g. change a "delta" counter to a "cumulative" counter)

It was immediately clear that their control plane was extremely powerful. We did a bit of analysis on the highest cardinality metrics coming in, and by cross-referencing them with a JSON export of our existing Grafana dashboards, were able to create a relatively small number of rules that reduced our cardinality by about 60%. It took a bit more work to go farther, but we eventually got our cardinality down to about ~1.7 million active series.

At this point, since we had a handle on our active series volume, Chronosphere's sales folks gave us an initial price, which turned out to be very reasonable.

Holy cow, it looked like this just might work!

Using Chronosphere

Once we had addressed the initial concerns around affordability, we got to work evaluating the product's overall fit. We had a bunch of teams convert some of their InfluxDB-backed dashboards and alerts over to Chronosphere, and started to get a feel for how it would be to use it day-to-day.

Since Chronosphere's UI was based on Grafana (v7), it turned out to be very similar to our self-managed Grafana/InfluxDB from a developer perspective, with the main differences being:

The PromQL language
Much better query performance

After a few weeks of playing with the product, we were satisfied it would do the job. We gave it the thumbs up.

Evaluating Grafana Cloud

Initially, we had sort of written off Grafana Cloud, since the price they gave us originally, based on our active series in InfluxDB, was in the same range as New Relic and DataDog. However, this was before we realized that they had a feature that was similar to Chronosphere's control plane, called Adaptive Metrics.

We told the Grafana sales team that, using Chronosphere, we'd been able to reduce our metrics to under 2 million active series, and asked for a new quote based on the assumption we could use Adaptive Metrics to get similar results in Grafana Cloud.

They came back with a price that was almost exactly the same as Chronosphere. The race was back on!

Using Adaptive Metrics

Once we updated our metrics pipeline to export to Grafana Cloud, and had a chance to start playing with Adaptive Metrics, we were disappointed to find that it wasn't nearly as powerful as Chronosphere's control plane. The biggest difference was that you could only target metrics based on their names, and not their labels or values. This was a big limitation, as we had a lot rules we had written in Chronosphere that did things like:

Drop all metrics from a specific service, except for a few key ones
Drop a particular metric generated by a telegraf plugin (e.g. procstat or diskio), but not for services in an "allowlist"

But we really, really liked Grafana Cloud

Aside from cardinality management, where Chronosphere clearly had the lead, we found a lot of areas where we preferred Grafana Cloud:

They had a more modern, polished user experience (both used Grafana as a front-end, but Grafana Cloud has the latest version, while Chronosphere's was pinned to v7, which is very old)
Their documentation was significantly better
They had support for multiple data sources, including CloudWatch, ElasticSearch, and Athena (which were important to us)
They were strong leaders in the open source observability community
Grafana Labs was a larger and more established company, with a more robust and mature product portfolio
It seemed credible that we may eventually be able to migrate our traces and logs to them as well, giving us a unified observability platform

It was clear that, besides the discrepancy in cardinality management, we'd prefer to go with Grafana Cloud. However, if we wanted to make this work, we'd need to find a way to handle the use cases that adaptive metrics wouldn't cover.

Taking another look at the OTEL collector

It turns out that the OTEL Collector (which I mentioned we were already using) is an insanely useful swiss-army knife for building observability pipelines. It can collect metrics, traces, and logs in virtually any format, run a pipeline of transformations, and output them in virtually any other format.

I knew that the OTEL collector had a number of processors available, though we hadn't used them much previously. I wondered if we could use these to replicate some of the more advanced metrics selector functionality that Chronosphere offered.

It took me a bit of time to figure it all out, mostly because the OTEL collector documentation isn't amazing, but I was eventually able to replicate pretty much all of the advanced "drop" rules we needed using the OTEL collector's processors.

Check it out

Check out some of the tricks I used replicate some of Chronosphere's drop rule features in the OTEL collector

In the end, with the combination of Grafana Cloud's adaptive metrics and the OTEL collector processors, we were able to get our total cardinality down to a similar level as we had with Chronosphere. While the resulting solution was a bit more complicated, it was an acceptable tradeoff given the other advantages of Grafana Cloud.

Conclusion

The experience of running a head-to-head evaluation of two vendors, especially given the penetration of OpenTelemetry and Prometheus in the market, was a real eye-opener. I'm more bullish than ever on OTEL (and cloud-native standardization initiatives in general), and I think its going to continue to reshape the observability landscape in the coming years.

I should point out that, even though we selected Grafana Cloud, I think Chronosphere would have also been an excellent choice. I think it might even be a better choice for a company that meets a few criteria:

Your biggest pain point is cardinality and/or cost management
You have a large number of metrics producers that would be hard to corral into a uniform schema
You don't have a lot of third party metrics sources (e.g. CloudWatch, ElasticSearch) that you want to query directly (Chronosphere integrates with those data sources by eagerly scraping them and converting them to Prometheus metrics... which can increase costs for sources, like CloudWatch, that charge by the API call)
You're OK with a slightly less polished user experience (or you're willing to wait for Chronosphere to catch up)

Confession

The sales engineering team at Chronosphere was absolutely amazing. They put in a ton of work helping me adapt our existing Influx-centric pipeline to work with Prometheus and OTEL. Plus they had to put up with me, who required a remedial education in Prometheus concepts before we could do anything.

They were so patient, knowledgeable, and great to work with, I feel legitimately terrible (on a personal level) we eventually decided to go with a competitor.

That said, Grafana Cloud has been a great fit for us. Their support and customer success teams, in particular, have been really effective in helping get our team ramped up and successful. Given this experience, we're interested in expanding our use into their logging (Loki) and tracing (Tempo) products. I'll let you know how that goes.

Addendum

I reached out to both Grafana Labs and Chronosphere with a draft of this post. I'm glad I did, because Chronosphere let me know that due to feedback like ours, they've been investing in some of the areas in which they were weakest relative to Grafana Cloud, namely UI quality:

They're the primary force behind Perses, which a competitor for OSS Grafana (which Chronosphere was previously using for visualizations). They weren't specific about the details, but my guess is the monolithic design of Grafana, combined with its AGPL license, limited their ability to integrate it effectively into their product without having their proprietary UI be infected with the AGPL redistribution terms. Perses is permissively licensed (Apache 2) and backed by the Linux Foundation.

It looks like it's designed to be modular and embedable, as well as be more IaC/GitOps-friendly than Grafana. The project is very young, but I'm excited to see some more open-source visualization options available.

Fun with OTEL collectors and metrics

Sat, 11 May 2024 00:00:00 GMT

As part of an evaluation of Prometheus compatible monitoring solutions, I found the need to push our use of the OTEL Collector to handle some use cases like creating metrics allowlists, renaming metrics, or adding and modifying labels.

Here's some examples, based on what I learned, of the crazy and powerful things you can do with OTEL collector processors to manipulate metrics.

Inserting static labels

As part of a multi-account AWS strategy, we have many Kubernetes clusters, spread across AWS accounts for each of our teams. We wanted to make sure that all metrics coming from Kubernetes clusters contain labels with metadata about which cluster and account they came from (beyond what comes with the k8sattributes processor).

We use the OTEL collector as a daemonset (so it runs on all nodes in our clusters), and all telemetry from our pods goes through them.

NOTE

Since these OTEL collectors are deployed in Kubernetes, we can inject environment variables into the pods with this static information (in our case these environment variables are set via Terraform).

attributes/cluster-metadata:
  actions:
  - action: upsert
    value: "${CLUSTER_ENV}"
    key: env
  - action: upsert
    value: "${CLUSTER_LABEL}"
    key: cluster_label
  - action: upsert
    value: "${CLUSTER_NAME}"
    key: cluster
  - action: upsert
    value: "${CLUSTER_TEAM}"
    key: team
  - action: upsert
    value: "${CLUSTER_REGION}"
    key: region

Inserting dynamic labels

We have a bunch of legacy services, deployed outside Kubernetes, that don't have an instance label (which is idiomatic in Prometheus). These metrics (generated by telegraf) do have a host label, however, so we used that to create an instance label, also using the attributes processor:

attributes/instance-label:
  actions:
  - action: insert
    from_attribute: host
    key: instance

Replacing useless labels with useful ones

When we scrape kube-state-metrics, the pod and namespace labels on the metrics are the pod and namespace of the kube-state-metrics pod itself. This isn't so useful; we don't care about the kube-state-metrics pod names, we only care about the pods that are the subject of the metrics.

Here's a trick where we use the attributes processor to remove the kube-state-metric pod/namespace labels, and then rename the exported pod/namespace labels to replace them.

This way, when users are querying for metrics on their pod, they can just use the pod label, and don't have to worry about the implementation details of how kube-state-metrics is scraped:

# Delete the pod and namespace labels which refer to the kube-state-metrics
# pod itself, not the pods the metrics refer to.
attributes/kube-state-metrics:
  include:
    match_type: regexp
    metric_names: "^kube_.+$"
  actions:
  - action: delete
    key: pod
  - action: delete
    key: namespace

# Rename the exported_pod and exported_namespace labels to pod and namespace
metricstransform/kube-state-metrics:
  transforms:
  - include: "^kube_.*$"
    match_type: regexp
    action: update
    operations:
    - action: update_label
      label: exported_namespace
      new_label: namespace
    - action: update_label
      label: exported_pod
      new_label: pod

You'll need to make sure that the attributes/kube-state-metrics processor runs before the metricstransform/kube-state-metrics processor in your pipeline, so that the old labels are deleted before the new ones are renamed.

Renaming metrics

Sometimes, we'd find older services had metrics that had been named in various problematic ways, so we wanted a way to rename metrics (e.g. to adhere to a naming convention). Here's a use of the metricstransform processor that renames all metrics with a badsuffix to have a goodsuffix instead:

NOTE

The double dollar sign ($$) is intentional; the OTEL collector would interpret ${1} as an environment variable. The second $ escapes the first, so that it's interpreted as a literal $, and used as part of the regular expression capture group.

metricstransform/fix-suffix:
  transforms:
  - include: ^(.*)_badsuffix$
    match_type: regexp
    action: update
    new_name: "$${1}_goodsuffix"

Truncating long label values

Grafana Cloud has a maximum label length of 1024 characters. Any metrics with labels exceeding this length will be dropped before they're ingested. Here's a nifty transform that truncates all label values this length:

warning

Why would anyone have a label value that long? Well, there's no good reason. But sometimes, just sometimes, a distracted programmer may accidentally include an entire stack trace as a label value.

Not naming names.

transform/truncate-labels:
  metric_statements:
  - context: datapoint
    statements:
    - truncate_all(attributes, 1024)

Filtering metrics with arbitrary queries

Here's where we get to the use case that we really needed: a way to drop large swaths of metrics entirely, based on arbitrary queries. The most common pattern we were trying to replicate was an "allowlist", where we drop most metrics, except for those that meet some criteria.

The OTEL collector has a filter processor to do this, and it supports the OpenTelemetry Transformation Language (OTTL), which allows you to write complex expressions to represent your filter criteria:

filter/drop-rules:
  error_mode: ignore
  metrics:
    datapoint:

    # An "allowlist" that drops all metrics from 'some-service' except for
    # the two specified.
    - >
      resource.attributes["service.name"] == "some-service" and
      metric.name != "some_service_important_metric" and
      metric.name != "up"

    # A similar allowlist, but using a regex to match the service name
    # for various sized fluent-bit daemonset pods (we have small, medium, 
    # and large variants of fluent-bit).
    - >
      IsMatch(resource.attributes["service.name"], "fluent-bit-.*") and
      metric.name != "fluentbit_output_dropped_records" and
      metric.name != "up"

    # A drop rule that drops a specific cadvisor metric for all services
    # except for those in a specific namespace.
    - >
      resource.attributes["service.name"] == "cadvisor" and
      metric.name == "container_file_descriptors" and
      (not IsMatch(attributes["namespace"], "^someservice.*$"))

    # A drop rule that shows you can use more complex boolean expressions
    # with parentheses to group conditions.
    - >
      attributes["telegraf"] == "1" and (
        IsMatch(metric.name, "^internal_(agent|gather|memstats|serializer|statsd|write)_.*") or
        IsMatch(metric.name, ".+[-_]request[-_]metrics[-_](median|sum)$") or
        IsMatch(metric.name, ".+_stddev$")
      )

    # Another complex drop rule, where we're dropping metrics matching a
    # regex for all services, but with a list of exceptions.
    - >
      IsMatch(metric.name, ".+[-_]request[-_]metrics[-_]upper$") and not (
        attributes["service"] == "service1" or
        attributes["service"] == "service2" or 
        metric.name == "inconsistently_named_service_request_metrics_upper"
      )

Limitations of the OTEL collector

It can't do aggregations

While the OTEL collector has many powerful processors, it doesn't currently have the ability to do aggregations (i.e. drop a particular label from a metric and create a new metric by aggregating the metrics that had that label). This is a much harder problem to solve than just dropping metrics, since all the OTEL collector instances that could process any metric you'd want to aggregate would need to coordinate with each other, creating some scaling challenges.

Both Grafana Cloud and Chronosphere offer powerful features around metrics aggregation.

It can't convert delta to cumulative counters

When I evaluated Chronosphere, I was delighted to find that it had a feature to change the temporality of metrics (e.g. change a "delta" counter to a "cumulative" counter), and I was hoping to replicate it with the OTEL collector. While the OTEL collector does have a converter processor in the works, it's still early in development.

In our case, we were able to work around this with some hackery in telegraf (setting the delete_counters setting of the statsd plugin to false).

The impertinent programmer

Sun, 05 May 2024 00:00:00 GMT

It was 2004, and I had a huge chip on my shoulder.

Wait, you need some background first

Let's back up for a minute... It was 2000, and I had been hired for my first actual job as a programmer at a company called Corex (makers of CardScan, a business card scanner). At this point I had a few years of employment under my belt, but this was as a graphic designer who did a little programming on the side. I was pretty clear about this in the interview for Corex, but I guess I did well enough on some programming problems (or there was some misunderstanding?) that my new boss was reasonably confident I could grow into a programmer who did a little graphic design.

They dropped me onto a team comprised entirely of people who had gone to engineering schools, wrote C++, and used words like "orthogonal". They gave me a web-oriented project, put me under the watchful eye of a cranky Russian PhD who would write COM objects that I could script against, and left me to decide how to glue this all together with a true gem of 2000-era web tech: ASP and VBScript.

I crammed books on programming most nights in bed, trying desperately to incorporate some high-level theory to make sense of the trial-and-error hacking I was doing during the day. The feeling of barely keeping myself from drowning eventually gave way to the sense of floating; I was making real, tangible progress, and I was having a ton of fun using a computer to put pixels on a screen.

Three years into this job, I got the opportunity to join a startup that had recently spun out of Corex called ZoomInfo (they're still around). I'm not 100% certain exactly what number employee I was, but I was definitely in the first ten. Most folks on the team were roughly the same age as me, but like the previous job, they had all studied computer science, and knew things like whatever a "heapsort" is for. They were all C++ slingers too, and once again, I was the rookie who was hanging with the pros.

OK, so back to 2004 and that giant chip

At this point, despite my questionable background, I'd earned enough trust to be put in charge of a small team working on our B2B products. The main product was also built with ASP (this is still classic ASP; .NET existed, but was still the bleeding edge) with an era-appropriate amount of CSS and JavaScript. The app got its data from a massive SQL Server that had been populated by web crawlers, using some natural language processing. This was some legitimately groundbreaking stuff, written by a French Canadian genius whose brilliance probably made the rest of the team feel like, well, how I felt around all of them.

The core UI of the product was pretty simple: you'd search for people by attributes (what industry they were in, what titles they may have had, what their field of expertise was), get a list of results, and click on one of them to get to the detail view.

I had inherited the implementation of this from some more senior members of the team a year previously, and had been slowly dolling it up, adding features and generally just making it look spiffier.

One thing had been bothering me since the first time I used the product, though: this detail view page took at least six or seven seconds to load, at a minimum. I mean, the page had a lot of great stuff on it, but it got annoying waiting for it on every click. It didn't seem to be a huge deal to anyone on the dev team, but the sales people occasionally grumbled about it.

Do you remember the 2000s?

Keep in mind, in 2004, a lot of us had just gotten our first cable modems, so we were used to going to get coffee while a big page loaded.

The page was built (as lots of ASP applications were at the time) out of C++ COM components, glued together into a UI with VBScript. With some very crude debugging, I could see that all the time was being spent in the C++ code, which I still had very little ability to wrangle.

I went to my boss (who may have been one of the original authors of said C++ component.. I'm not 100% sure), and asked about the performance bottleneck. He told me that C++ was the fastest, most efficient, and powerful language we had available, and that there was really not much that could be done to make this component any faster. The real engineers had already taken their pass at this, so I should let it go and get back to delivering that feature the sales team wanted.

This felt like a dam breaking; at this moment, I felt like I had endured the subtle condescension of these "real" engineers for 4 years now, and despite how much I had grown, I realized they were always going to view me as 2nd rate.

Programming out of spite

I knocked out the feature the sales team wanted within an hour of this conversation, and instead of going back to JIRA for more work, I made the deliberate decision to carve out some time to prove my boss wrong.

First, I had to decipher this god forsaken C++ code. I was able to hack in some printf statements to log each SQL query as it ran, compiled it, and within another hour, was able to replicate the data access pattern in an interactive SQL Server query window. As expected, these queries did indeed take about 6-7 seconds to run.

This component had a fancy, object-oriented design, which elegantly encapsulated each row as an object which was responsible for fetching its own data. The effect was it was hitting the database like a machine gun- running a separate query for every row in the grid it was rendering. In most cases it was running like 60 queries to render this page.

Fun fact

I didn't realize this yet, since C++ was inscrutable to me, but there was also no connection pooling configured for this C++ driver, so each query was establishing a new connection to the DB.

So I concocted 4 very, very, dumb queries to retrieve all the same information that the C++ component did, and ran it in the SQL Explorer. After a little bit of tuning indices, the whole thing ran in like 300 milliseconds.

At this point, I was pretty confident I was on to something, so I snuck a little time over the next couple days to wrap this all up into a VBScript function (with proper connection pooling) and replace the C++ component altogether. I fired up my localhost server, started clicking around from the search screen into the details pages... and it felt insanely fast!

Err, so what do I do with this now?

At this point I was in a bind. In the sober light of day, I realized that a fit of rage, I had engaged in an unauthorized product improvement. I really wanted to show my boss, but I was honestly a bit afraid.

So I tested the water by showing a couple of my colleagues, who initially thought it was some kind of trick. Once they realized this was for real, there was no keeping the horse in the barn. Within a few minutes, my boss caught wind and came over to see what was going on.

Reality check

Hold up a second, though. Did I give you the impression that my boss was some caricature of a feckless micromanager? Remember, this story is from the perspective of a twenty-something who's been deranged by insecurity.

Truthfully, my boss was a lovely person, and had been a really important mentor to me. Seriously, he had attended my wedding, and we kept in touch for years after I left.

The tone here is mostly for effect, though does reflect my emotional truth at the time.

So once I explained what I did, he was actually fairly impressed, and called the CEO over to see it too. The CEO was a legitimately intimidating Israeli guy (who had worked in the Mossad), but even he had a hard time moderating his delight over this unexpected gift.

Looking back, this was fairly trivial

In retrospect, this was a really elementary little exercise of basic engineering. I've done many more difficult and interesting things since, but this one still sticks out in my mind, because it was my youthful impertinence that pushed me to do something no one else thought I could do... or even try to do.

20 years later...

Fast forward to just a few days ago.

I had a meeting scheduled with a guy at work who I hadn't met before. The invite said he had some questions about Helm and Kubernetes, so I thought, perfect, I can help.

Right out of the gate, the meeting went a little sideways, when I realized he was asking me for help finding a way to deploy his app to our Kubernetes cluster without using the client tooling that my team had built for this purpose. He was generally skeptical of any internal tools, and assumed they must obviously be inferior junk, and would just get in his way.

It took me a second, since I was self-aware enough to recognize I was feeling defensive, so I took a deep breath and asked some questions about his app. After a few minutes of questions, we had a pretty good understanding what he wanted. Before I did any advocacy, I clarified that I didn't believe in forcing our tools on anyone, and that he was free to make any decision that made sense to him and is team. But I asked him to take a few moments to listen while I listed some of the problems that were solved in our platform tooling that he would have to replicate if he decided not to use it.

This included stuff like IAM integration, Service/Ingress integration with AWS load balancers, cross-platform docker builds, configuration management, ephemeral environments, and test orchestration. I gave him a chance to ask some questions, to help him understand what he was getting himself into. The tide turned a little bit when, while arguing that IAM integration shouldn't be so hard, he said he could just inject some (long lived) AWS credentials into his pods. At this point, one of his colleagues realized he was advocating doing something that was totally bonkers (and a violation of our security policy).

After this, opened up a little and we were able to figure out that a lot of what he thought about our platform tooling were misconceptions, and it actually did all the things he wanted. He agreed he'd start out with our stuff, and let us know how it went.

The nerve of that guy!

It took me about an hour to unwind from that meeting. I was so miffed! Our team's platform tooling has been wildly successful, and its been a couple years since we needed to do any proactive advocacy for it. Demand has been spreading mostly by word-of-mouth, as our developer teams have been really happy with it.

I paced my kitchen while obsessing over the interaction. How impertinent! Doesn't he know that I've already solved these problems? He just casually dismissed all the work my team has done over the last 2 years! He thinks its trivial; he'll just whip out some shell scripts to solve everything. It can't be that hard.

Recalling the virtues of impertinence

At this moment I remembered my own experience as an impertinent young programmer, and my mind began to settle. I realized he was offering me a gift: the perspective of someone who, however naive, might have ideas or insight that I was missing. It had been a while since I'd faced this kind of skepticism, and I realized this was a good thing- it's important to have someone keep you on your toes.

This was a reminder that impertinence can be a virtue: a fuel to do cool stuff. Hopefully the next time I meet an impertinent upstart programmer, I'll smile and keep my thoughts to myself.

Or wait... maybe I should be really condescending to get them fired up? I'm gonna have to think about that.

Finding my outside voice

Sun, 28 Apr 2024 00:00:00 GMT

For most of my career, I've found that I tend to quickly develop a reputation in whatever company I'm working at. I've never been the best programmer, but I've got some breadth, creativity, and critical thinking skills, and I'm good at synthesis and communication. This has helped me see the big picture in moments where a novel idea was needed, and I was able to connect some talented people to come up with some cool stuff together.

The small pond

Because of this, people have mostly taken me seriously within my companies. In my work as both a platform engineer as well as in engineering leadership, this has come in pretty handy. I do a lot of internal communications, mostly with developers, but also across organizations, including occasionally the C-suite, where the objective is often to influence behavior in some way. For example:

I have to convince team managers to commit some of their team's time to test out a new observability vendor and get me feedback before a purchasing deadline
I need to convince a product owner that if their team spends some time migrating their app to our new platform, it will pay off by making their team faster and their product more reliable
I need to convince developers they need to adopt some new front-end standards, because mobile web is actually a thing now, and we need to make sure our site works on phones (lol, this one is a bit dated now)
I need to convince the CTO that an assumption they made isn't correct, and we need to pivot ASAP

In all these kinds of communications, I usually start with writing. Not only does it help me clarify my own thoughts, but I feel like its a conscientious way to engage with someone when you're obviously trying to influence them. It gives them a chance to read, absorb, and process, without being put on the spot. Then it can be a lot easier to have subsequent conversations.

Writing is an even more essential tool for communicating to a larger audience, and interestingly, the same dynamic applies. Sending an email to a whole division of a company has a slightly different flavor than to an individual, but I've found that the more you treat it like an invitation for further conversation, the more effective it is at actually influencing behavior.

My inside voice

So I ended up writing a lot of internal documentation, emails (and increasingly more multi-paragraph slacks), Confluence articles, and occasionally some slides for a presentation. In all of these, I've found a particular voice that feels appropriately authoritative, but also approachable and informal, with a bit of self-deprecating humor, and the occasional pop culture reference thrown in.

These communications have always been relatively well received, and usually effective. I've also found that developers tend to read my emails at a much higher rate than those from others (sometimes more than those from senior executives).

But at some level, I've always known that a big part of this is my reputation and pre-existing standing in the company, and I'm honestly not sure how big. It's easy to be confident about my communication when I know a big chunk of the people in my audience personally, and understand their context, concerns, and environment.

Finding my outside voice is a little scary

I've been procrastinating on starting this blog for over 15 years. I wrote my first entry a few months before my youngest child was born... and she's now in high school. I have to admit it's partially because I'm a bit nervous about leaving my comfortable little corporate bubble where everyone knows me.

Part of me thinks all these "great insights" I've been wanting to share for years are going to get absolutely shredded in the daylight, given the tech world is filled with brilliant people, and on any single topic I think I know anything about, there's someone who knows a whole lot more. Perhaps I'm absolutely full of crap, or perhaps my insights are just boring and obvious.

Something changed in the past few months, and the urge to start sharing some ideas has finally overcome my fear of getting pilloried. I'm finally getting to the point where, however masochistic it may be, I'm really starting to crave external feedback. I want a way test out some of these crazy ideas in the open, where I can see how a more diverse and knowledgeable population reacts to them.

Mustering the energy

The other obstacle was finding the time and energy to do the technical work to get this thing up and running. My site used to be on Vistaprint's website platform (which I helped build back in the day, so it was free for me), and had since been (forcibly) migrated to Wix. Frankly, working with Wix... did not make me happy.

At my current job, I finally found a static site generator that I like a lot for documentation, so that gave me the push I needed to export my stuff out of a languishing Wordpress, and get it published to Github Pages.

At some point soon I have to figure out SEO, and probably get over my disdain for social media and start putting myself out there.

Conclusion

I remember as a kid, I was sometimes loud, and my teachers and parents would occasionally admonish me to use my inside voice. Perhaps it was good advice at the time. For me, though, I think it's finally time to find a voice for speaking outside.

Hopefully those mean kids across the street won't throw rocks at me again.

Let a thousand flowers bloom

Sat, 27 Apr 2024 00:00:00 GMT

I've been reflecting recently on a really formative period in my career, when I had a chance to be part of a massive experiment in progressive engineering management.

About 3 and a half years before I left Vistaprint, I was asked to join the Engineering leadership team by our (relatively) new VP of Engineering, Erin DeCesare (who is now the CTO of EZCater). She was a particularly bold leader in terms of her progressive management ideas, and was rapidly reshaping the organization with a strong set of values around empowerment and servant leadership.

We were responsible for about 200 developers, who were organized into squads (a.k.a. two-pizza teams), who were then loosely grouped into "tribes" (an idea borrowed from Spotify). The real difference from the previous regime was a pretty extreme amount of autonomy given to teams; they could choose their own technologies, work processes, architectures.

On top of this, Erin was pushing farther into some even more progressive empowerment concepts. For example:

Managers were given a lot of coaching on servant leadership, and the ones who weren't able to evolve were managed out
Teams would be supported by embedded agile coaches, who helped them optimize team health properties, such as psychological safety and culture of feedback
Teams were given the freedom to decide what work to pull from our enterprise backlog
Teams would decide how to distribute bonuses between them (this one went a bit sideways)

All this stuff was a bit mindblowing to me, but I was doing my best to commit to making it all work. It was an extraordinary amount of cat-herding, managing through influence, and general chaos, but it felt crazy enough that it just might work.

We set up some initial organizational mechanisms to attempt to manage all this chaos, since the whole point was to build a better team that could deliver features and products that made our customers happy. We set up feedback loops, elevated thoughtful and visionary people, and set up structures to help make sane architectural decisions.

Then, things got very interesting.

Then we watched the experiments unfold

We had something like 30 squads, each running themselves in almost any way they saw fit. There was a lot of diversity between teams; folks with different backgrounds, different technology preferences, stronger or weaker opinions, different seniority levels, etc.

What I witnessed was 30 different teams running 30 different experiments into what makes a team successful... or not.

Some examples:

Some teams stuck conservatively to working in our monolith, and did a release every 3 weeks, others spun up Kubernetes clusters in AWS with KOPS and deployed their apps via helm charts several times a day.
Some were absolutely religious about automated testing, and obsessive about code coverage, others had a more deliberate risk-management strategy.
Some teams spent a long time getting observability working, and others relied more on signals from external SRE and QA.
Some worked via consensus and mostly made decisions together, others had one or two very senior folks which set the direction for the team
Some teams did all their work in pairs, or via "mobbing", while others only interacted with their teammates when they were stuck.

There was also a lot of variation in the adoption of the agile philosophies we were coaching the teams into adopting.

Things I noticed

After running in this mode for about 9 months, there were a few themes I noticed which have really informed my perspective today:

Engineers will fill all available space with engineering work

One of the "tribes" (4 squads, around 25 people) was given an objective, some constraints, and 6-12 months to deliver a new platform for managing our product catalog. We were replacing our "legacy" product system because it had grown too creaky and complex over the years, and we wanted a bunch of new features, especially for marketers to be able to manage content without needing engineering time.

Yeah, I know, this is a classic case of second-system syndrome, but everything just seemed so obvious to us at the time; we were smart people, and it seemed achievable to us. So we did some design, broke the work up and gave pieces to each of the squads.

As it turns out, our teams invented engineering problems to fill all available space. They created an elaborate network of microservices, built SPAs out of hip new JavaScript frameworks, integrated some truly abominable "enterprise" vendor products, and designed massively complex processes and workflows to solve every use case that we failed to account for in the legacy system.

It started to hit me that we had jumped the shark when I noticed that a microservice one of the teams built could have been implemented as a single text file. Once that clicked, I started realizing, to my horror, that everywhere I looked, the entire system was like this. And then, through the haze of the groupthink, it occurred to me:

We'd have been better off giving this entire objective to a team of 5 people for a month.

A small, constrained team would have had no choice but to build something small and simple that worked. Then they would have had to evolve it incrementally, which as we now all know, is the only way to build a working system.

Agile can be so powerful, or so horrible

Some teams got really religious about their agile methodology (usually Scrum), and used agile "rules" as a weapon against their PO, manager, and occasionally each other. They'd spend a lot more time playing games with ticket management, ceremonies, and storypoints than they did thinking about customers, products, or using their common sense.

One of the best defenses against teams going this direction was having great agile coaches (NOT the high priests from big consultancies who spend their time promoting seminars on LinkedIn). A good agile coach can provide a sort of continous intervention for a team; they hold up a mirror, allow them see their own disfunction, and help re-center them on what matters.

From observing these coaches in action, I formed two separate beliefs:

I've noticed that great agile coaches tend to also be very product-centered, and are genuinely interested in developing technical and domain knowledge from the team they're working with. They ask a lot of questions, and develop insights that are specific to the practical limitations the team is managing. They don't just waltz in and start unloading a bunch of dogma.
Its interesting that when you see a healthy, high-performing team, agile methodologies are sort of invisible; they're there, but they sort of melt away into the background. The team is just focusing on the actual work: how to meet a customer's need, what hypotheses they should be testing, or thinking critically about business requirements in order to find an 80/20 solution.

The cult of test

We had a few teams that drank the TDD kool-aid hard. One in particular had agreed as a team to take testing really seriously. They set about implementing extensive test suites, held book groups about BDD, and spent hours trying to compile Gherkin files so their business partners could define test cases (which they never actually did). They believed in the singular truth of the test pyramid, and went big on unit tests, shunning higher-level approaches as lacking in virtue. They set up CI/CD to fail builds when code coverage dipped below 95%.

Things started to go sideways fast. Because their test suites were targeting low-level implementation details, every feature they implemented broke a gazillion tests. They were terrified to refactor anything (including the tests) because it would ensure weeks of work and an avalanche of merge conflicts. They'd get nothing done, sprint after sprint. The result wasn't amazing quality, it was utter paralysis.

Deep Thoughts

There's definitely some lessons to learn here specifically about testing practices, but I think this is really a more general case of a failure to apply continuous improvement. The intervention wasn't to parachute in and tell them how to do testing better, it was to have them pause and reflect, re-focus on what they were trying to accomplish, and give them permission to try something new.

With some perspective and coaching, they started to re-think their testing strategy. They began to see that some risks were more important to mitigate than others, and some testing techniques gave you a lot more bang for your buck. They came to the conclusion that 95% on a code coverage report wasn't a business objective, since we were building a freaking e-commerce site, not pacemakers or cruise missiles. They gradually woke up from their fog, and came up with some much more clever testing strategies, such as testing the high-level interfaces and APIs, which were far more stable, and gave them the freedom to refactor implementation details.

Moderately strong teams outperform superstars surrounded by a meh team

We had a few squads where 4-5 (out of 6) team members were solid "A-" or "B+" players. In contrast, there were other teams held together by one very strong "superstar" lead surrounded by "B"s and "C"s. It was strikingly obvious to everyone that the teams with the more homogenous, moderately strong members significantly outperformed the teams with the superstar.

My take on the underlying cause was that the superstar would get randomized trying to support the rest of the team, and didn't have enough time to do anything innovative. If they ever did manage to spend some time on something interesting, it was generally too advanced for the team to run with, and the momentum fizzled. The rest of the team became helpless and dependent on the superstar to make decisions.

Of course, there was occasionally the more enlightened superstar, who would spend their energy trying to elevate their lower-performing team members. The effectiveness of this was highly dependent on the latent potential of the rest of the team, and usually didn't work without management intervening to shore up the team with additional strong engineers to support the superstar.

In general, the ratio of strong vs weak really can't dip below 2/1 before the team veered into unhealthy territory. The members of strong teams will support and help each other, but there has to be balance between them; if it gets asymmetrical, it starts to drain everyone.

Blinding flash of the obvious

Who knew, you need good people to have a good team.

Strong POs are critical

We had Product Owners deployed to each squad, usually a PO would cover 2-3 squads. The difference between a great PO and a bad one was very stark. POs ultimately decide what the teams work on, so it seems fairly obvious that they'd have an outsized influence on the team.

At the core of the role is having great instincts about customers, the product, and the problem space. But there are other key factors which are underappreciated in POs:

Empathy and rapport with developers
A willingness and interest in understanding technical limitations and tradeoffs
An understanding that they're not only responsible for the customer experience, but the long-term health of the technology it's built on

Looking back

I imagine for most people working there at this time, being the guinea pigs in a giant experiment might not have been the ideal work experience. Actually a lot of folks thrived, and did some amazing work. But some reeled from the unrelenting waves of change, and others understandably just threw in the towel.

For me, though, it was very different. I got to see all this from a birds-eye view, but also on the ground, since I spent time with nearly every team. It was a bit like watching a whirlwind from inside- it gave me a sense of how seemingly small, well-meaning and thoughtful inputs can have huge, unintended effects that ripple across the whole system.

Like a lot of younger engineers, I used to occasionally express casual and flippant disregard for out-of-touch upper management. OK, admittedly I still might feel this way from time to time, but this experience left me with a very different understanding of what it takes to manage a large engineering organization, gave me a dose of humility and appreciation for how challenging it is.

I'm incredibly thankful to Erin for taking me with her on this journey- I really couldn't have had more learning jam-packed into a few short years of my life.

DevOps is a stew

Sat, 20 Apr 2024 00:00:00 GMT

When learning a new recipe, especially when dabbling in cuisine from different cultures, I find it really important to make sure one is really precise in their understanding the words used in the recipe. I've had a few unfortunate misunderstandings that resulted in... gastronomic disaster.

Similarly, I find that I can't responsibly use the word "DevOps" without testing that the person I'm talking to know which meaning I'm using. Here's some examples of what someone may think I mean when I say "DevOps":

The whole category of stuff that happens after you do a git commit, that magically makes your code turn into running software
A type of engineer that does the cloudy, opsy stuff
A style of operations where there's lots of automation and infrastructure as code
A culture where developers and operations people collaborate more tightly (vs the bad old days of "throw it over the wall")
A style of software development where the software engineers figure out how to deploy and operate everything themselves

The thing I usually mean is similar to those last two, but here's a crisper version:

An engineering management philosophy in which teams are are responsible for operating the software they build, in order to create a virtuous feedback loop which incentivizes the team to make their software highly reliable and operable.

I think this is the most useful meaning of the word, mostly because a big percentage of the other meanings are covered by preexisting words, like "operations", "CI", or "infrastructure as code". I mean, yeah- there's definitely a specific modern flavor of operations, but I find it more useful to use more specific words about those practices.

Quick Rant

Some companies have a team called "DevOps". When I hear this, my eyebrows become raised, and I wonder to myself if somebody just decided to rename the "Operations" team.

You know, the team the developer teams can "throw it over the wall" to.

Th DevOps-flavored stew

The thing I find really interesting about the engineering-management philosophy definition of "DevOps" is how interdependent it is on a whole bunch of other ingredients that coincided historically with it:

Microservices
Cloud
Automation and Infrastructure as Code
Containers and orchestrators
Continuous Deployment
Shift to automated testing and observability vs manual QA
Platform Engineering

Microservices

About 10 years ago, the company I was working for had outgrown our monolith, and we reluctantly started on the journey to microservices, and the journey was still ongoing when I left (about 6 years ago).

The thing that surprised us first was the sheer magnitude of the overhead of managing all the operational stuff that had been solved problems in the monolith. The first teams that started extracting their own services spent many weeks just trying to replicate a small fraction of what we had previously taken for granted: reliable builds, rolling deployments, centralized logs, metrics, alerting, feature flags and associated tooling, etc.

Flashback

Sometimes looking back on this time, I think about just how adorable it was that we thought we were going to move to microservices but everything else was just going to continue working as-is. We were so cute.

Cloud

At that point, we started toying with some public cloud (AWS), and found that there was a lot of excitement from teams using it. The fact that their infrastructure could be fully automated through reasonable APIs alleviated a whole bunch of the pain we were feeling trying to automate deployments to on-prem servers.

Infrastructure as code

After building out some of this cloud automation with shell scripts, we quickly discovered that we needed some more powerful ways to manage infrastructure as code. I think at that point we played with CloudFormation and some early Terraform. We were still struggling though, caught between the low-level (infrastructure-as-a-service) nature of EC2, and the relatively immature platform-as-a-service offerings AWS had at the time. We made a little headway with tools like Spinnaker and Octopus, but deployments were still relatively slow and risky.

Containers and orchestrators

Around this time, Docker was making waves, and we started experimenting with it and early versions of (pre-EKS) Kubernetes and ECS. The speed and ease of deployments, relative to what we had been doing with hand-rolled automation of EC2 and autoscale groups was game changing. Suddenly, treating infrastructure as immutable felt natural, and deployments were cheap and fast.

Continuous deployment

The teams that had adopted containers, kubernetes, and ECS quickly discovered the power of continuous deployment. While it was technically possible previously, deployments were slow enough that teams were still batching up changes and doing big-ish releases (maybe a couple times a week). Now, the opportunity presented itself to deploy any given feature the second it was ready.

Shift to automated testing and observability vs manual QA

As the braver teams started to actually practice continuous deployment, they found that there was an increase in the number of bugs that would remain undiscovered, sometimes for days. In retrospect, our culture had been too reliant on having a QA team, who was organized around doing manual regression tests on big batch releases. Teams began to re-discover the need for some essential ingredients of continuous deployment:

Robust observability and alerting
Running automated regression testing in CI

Platform Engineering

At this point there was a huge, and increasing gap in maturity between teams who had invested significantly in their operational capabilities, and teams that hadn't. It was also clear that to get to even a baseline level of continuous deployment required months of investment from every team... and I don't think we'd even come to grips with the reality of maintenance on all that stuff.

It became obvious we needed to find a way to share the capabilities between teams, so we started experimenting with ways to reclaim some of the abilities we used to have with our monolith- but in a way that worked in a world of autonomous teams and distributed systems.

We quickly realized that some things were a no-brainer to centralize: running Kubernetes clusters, CI/CD, and observability infrastructure, in particular. We also started playing with integrating other opinions and best practices into tooling, and trying to find the balance between operational uniformity and developer freedom. At some point in the past few years, we started calling this "Platform Engineering".

Flavoring is key

Looking back, it's actually hard for me to imagine, in any practical way, how any of these practices could exist on their own. I mean, you can gnaw on a potato, but it's hard to call that a meal without the rest of the ingredients.

Historically speaking, there's a few more tasty bits sprinkled into this stew:

Servant leadership and Intent-based leadership
The Lean Startup varietal of Agile

These are really key to building a modern engineering culture where developers can flourish; you can't have a high-performing team without enlightened leadership.

I should note that I've tasted a version of this stew, but with these flavorings replaced with "command and control" and "waterfall". That stew tasted like garbage.

Karpenter, you complete me

Mon, 15 Apr 2024 00:00:00 GMT

Every once in a while, some new product comes along that solves a problem you didn't know you had, and does it so well that after you've had it, you can't imagine how you ever lived without it.

This is how I've come to feel about Karpenter. I guess you could say that the category it lives in already existed, given it's designed to replace the Kubernetes Cluster Autoscaler, but the effect it's had on my life as an EKS cluster operator and platform engineer makes me feel like the comparison cheapens it.

Back in my day, we had to use cluster autoscaler

Let's take a stroll together into the recent past, when, to provide nodes for an EKS cluster, we had to provision EC2 autoscaling groups. In most cases, we didn't have a lot of information about what types of workloads were going to be run on the cluster, so this involved a lot of guessing about the best fixed set of instance types/sizes we should choose.

Then, when load on one of our apps starts to pick up, its HorizontalPodAutoscaler would kick in and increase the replica count of a deployment, and some pods would get created. When they first started up, they'd be in a “Pending” state because there weren't enough nodes to satisfy their resource requests. At this point, the cluster autoscaler would notice the pending pods, and make some API calls to increase the node count of the AutoScaling group by 1. Then we'd wait about 6 minutes for the instance to come up, join the cluster, and be ready for pods to be scheduled.

But perhaps there were more pods than could fit on the new node... at which point the process would start over again until all pods were scheduled. Sometimes, with larger deployments, it would take 30 minutes or so for all the required nodes to come up.

And then when it came time to scale back down, depending on which particular poison you wanted to pick within the scale down options (cost savings vs potential node flapping), you may have faster or slower scaling down. If you chose a managed NodeGroup to control your AutoScaleGroup, you'd get graceful draining of nodes before they shut down, so your customers wouldn't be disrupted.

All the while, we were spending a lot of time working with teams who had specific instance type requirements to configure different NodeGroups with different instance types, and set up the right taints on each so that teams could target them. Then we'd have to make sure their workloads' tolerations were set right, and we'd have to live with the fact that each NodeGroup might be underutilized, depending on what workloads actually ended up on them.

This was all fine, and I didn't really know that life could get better.

Then Karpenter walked into my life

I heard about Karpenter from the Containers from the Couch YouTube channel. They said a lot of words that sounded good to me; cost optimization, faster autoscaling, more flexible node types, support for multiple instance types. So a few weeks later I started installing Karpenter and playing with it.

Karpenter's whole schtick is that it bypasses the traditional cluster-autoscaler model, and directly interfaces between the kubernetes scheduler and EC2 APIs, incorporating a bunch of intelligence about how to optimally provision nodes. It sounds like a simple difference, but it has huge implications.

Jumping ahead a few months, all our EKS clusters use Karpenter instead of the cluster autoscaler. Here's some of what we've got now that we didn't before:

Cost savings: EC2 costs are roughly 1/2 of what they were (relative to our workload sizes). A lot of this is because we can now safely use spot instances for a bunch of workloads, but also because it's easier to use ARM nodes, and because of Karpenter's intelligent cost-optimized instance type choices, bin-packing, and consolidation (smart scale-down).
No need to guess about instance types: As a platform engineer, I don't need to use my crystal ball to try to guess what instance types developers will need for their workloads. They just use Kubernetes scheduling primitives in their pod specs (e.g. resources, node selector and/or affinity expressions, tolerations, topology constraints, etc) and Karpenter gives them the cheapest nodes that fit their requirements.
Faster Autoscaling: Autoscaling is way faster. It usually takes about 3 minutes for Karpenter to have nodes up and running for all unscheduled pods.
Easy migration to ARM nodes: Developers can now flip a switch in their service's configuration (which adds a node selector and toleration under the hood) and have their services running on ARM servers (Graviton on AWS). This is about 30% cheaper for similar performance.
GPU instance support: On our AI workloads, we can flip a different configuration switch, and have their pods run on instance types with GPUs (and the right Nvidia plugins).

What I didn't fully grasp until I had all this running was that we now had a developer experience that felt… serverless! Developers mostly don't think about node types anymore, they just express their requirements in basic Kubernetes manifest files.

From an operator perspective, beyond the cost savings (which are still mind-boggling), Karpenter is at least as easy to use as managed NodeGroups. Cluster maintenance is astoundingly easy; we can push a new AMI to all our nodes (e.g for cluster upgrades, etc) with a one line config change to a custom resource (EC2NodeClass), and Karpenter handles all of the node rotation, including graceful draining of workloads. It respects PodDisruptionsBudgets, and the graceful termination properties of pods, so this is pretty much seamless from the perspective of both developers and customers.

OK, what's the catch?

For real, this works pretty much as well as it sounds. There's a few things that you'll notice, which are general best practices, but are more essential when using Karpenter:

Specify and tune resources.requests for all your pods

This is absolutely core to how Karpenter knows what types and quantities of instances you need. And really, if you're doing anything real with Kubernetes, you should be doing this regardless.

I rely on our cadvisor prometheus metrics (e.g. container_memory_usage_bytes and container_cpu_usage_seconds_total in particular) to track the maximum pod memory and CPU utilization over a window of a few weeks, and set requests accordingly.

tip

Here's some great advice for setting requests and limits: https://home.robusta.dev/blog/stop-using-cpu-limits

Use PodDisruptionBudgets for all deployments

When Karpenter deprovisions nodes (due to consolidation, scale-down, or node rotation), it needs to know how slowly it should go to keep your service at a satisfactory capacity.

But remember: don't be too stingy with the maxUnavailable setting, especially if you're using spot instances. In the case of a spot interruption, Karpenter needs to be able to drain a node within the 2 minute spot interruption window, and if your maxUnavailable setting is too low, it won't be able to drain the node fast enough to avoid a decidedly ungraceful shutdown of your pods.

Even if you aren't using spot instances, choose a good maxUnavailable setting so that your deployments and node maintenance will run in a reasonable amount of time.

Implement graceful termination correctly

A few times we found a developer with minimal Kubernetes experience had neglected to implement graceful termination. The most common mistake was to ignore SIGTERM, which would cause pods to just keep going until they got exceeded their terminationGracePeriodSeconds and got SIGKILLed. Really long pod termination times makes lots of stuff hard (including deployments), but it also makes node deprovisioning/rotation with Karpenter really slow.

On the other hand, pods that shut down immediately when receiving a SIGTERM, especially if they're in a load balancer target group, will result in HTTP clients receiving 5XX errors, because the pod will have terminated before the load balancer stopped sending requests to it.

Fun Fact

If you use the aws-load-balancer-controller, and have an ALB target group pointing to pod IPs, you need a minimum 15 second delay from when Kubernetes first sends a SIGTERM to your pod until you begin actually shutting down. It takes about that long for the controller to update the ALB target group to deregister the pod IPs. Even if your app's HTTP server has a way to stop accepting new connections and drain existing ones, you should wait the 15 seconds to even start the process, since the ALB may keep sending new connections regardless if your HTTP server will accept them.

And remember folks, always monitor your load balancer CloudWatch metrics. If you see spikes of random 5XX errors coming from your load balancer on deployments or node draining, you don't have graceful shutdown working properly.

Set a max instance CPU size

Fun story, I once got an alert about a crashlooping FluentBit daemonset pod. I went to check it out, and found the poor agent was getting absolutely crushed with logs. It took me a minute to realize that this was because Karpenter had provisioned an instance with 72 CPUs, and had scheduled a gazillion pods onto it. Of course a single FluentBit pod, with resources tuned for nodes that were quite a bit smaller, wouldn't be able to keep up.

It turns out that that instance was chosen by Karpenter because it was a spot instance, and was in fact the most cost effective way to schedule all the pods that had been created during a particular burst of activity! Who knew!

After that day, I set our Karpenter configuration to add a maximum instance size (16 CPUs), just to simplify how we manage daemonset resources.

Don't use instance types with bursting

Karpenter's purpose in life is to efficiently match requests with capacity, and that doesn't work well with instance types that can suddenly run out of bursting credits and have a totally different capacity (e.g. EC2's t instance types).

I often listen to the podcast KubeFM, and the host, Bart Farrell, usually starts by asking each guest which 3 tools they would install first on a new Kubernetes cluster.

For me, #1 is going to be Karpenter, every time.

Kubernetes might not be for you

Sun, 14 Apr 2024 00:00:00 GMT

Most mornings, after pulling myself out of bed, I put some semblance of a breakfast together. While eating, I usually take in the news (via the Android app of a traditional newspaper). I timebox this to about 10 minutes, which fits my breakfast-eating pace, and balances my desire to be an educated, responsible citizen with my tolerance for the existential dread I'm going to feel after reading about US politics.

Pepper and coffee

Once I'm sufficiently fed/educated/terrified, I head over to the couch, where the dog joins me for a cuddle while I sip my coffee. At this point, I usually switch over to Hacker News. I've found that Hacker News is a pretty reliable purveyor of articles on topics that overlap my interests. I also appreciate that it gives me an nudge to get outside my go-to subjects, into pretty niche topics in tech, science, math, culture, philosophy (and interesting people who recently died) - all with a taint of delightful nerdiness.

And unlike the comments in my newspaper's app (which are really best avoided), the culture of discussion in the Hacker News community is explicitly civil, and values intellectual honesty and earnest curiosity. Comments are often written by surprisingly authoritative people, and points of debate are usually argued in a thoughtful way. Especially for topics I may never have put much thought into before, this leaves me with a sense of the breadth of possibilities and an appreciation for different points of view.

Of course, some articles are on topics in which I'm professionally knowledgeable, and these can land on me.. differently. One of these topics is Kubernetes (and platform/cloud engineering in general), and unlike a lot of topics on Hacker News, I've found that this topic provokes a certain, reliably repetitive debate.

The Prototypal Hacker News Kubernetes Article

Here's a synopsis of this type of article:

My app is deployed with a simple shell script over ssh, scales great, and handles X requests a second. This is simple and effective; people who choose Kubernetes are accepting a huge amount of complexity and overhead to do what I can do with much simpler tools.

This can also get started from the opposite direction:

Our company uses Kubernetes, and here's some of the lessons we've learned, tools we use with it, and problems we've solved. There have been challenges, but overall we're happy with Kubernetes.

The themes of comments tend to fall into these categories:

Kubernetes won a marketing war, its popularity is driven by resume-driven DevOps people or IT executives obsessed with buzzwords
Here's a list of my Kubernetes horror stories (with some legitimate, truly horrifying stories)
Kubernetes is unnecessarily complex, here's the way I do it instead (simpler bespoke solutions with Ansible/Docker/Terraform/ssh, public cloud-provided services like ECS/Lambda, monolithic deployments, simpler alternatives such as Nomad, etc)
Kubernetes is designed for FAANG scale, and isn't appropriate for mere mortals. You're never going to need to scale to the point where the overhead is justified.

The counterpoints are usually something like this:

Here's all the things you get from Kubernetes that you'd have to solve yourself if you don't use it (rolling deployments, self-healing, load balancing, autoscaling, etc)
The overhead of Kubernetes is worth it when you're faced with the need to scale your app significantly
Kubernetes' complexity is an expression of the essential complexity required to run distributed systems, it provides the right abstractions for the problem space
Kubernetes came out of the experience of many iterations of operational platforms at Google, and solves the problem of large deployments more effectively than anything else

The TL;DR on if Kubernetes is right for you

Cards on the table: I'm a fan of Kubernetes. I've been working in the platform engineering space for about 15 years, and for the last 7 years I've been building platforms on top of Kubernetes. There's a lot of reasons for this, but I want to be clear about a few things:

The complexity cost of Kubernetes is very real
There are many legitimate horror stories (especially before well-managed managed Kubernetes solutions were available)
It requires a (perhaps small) team of experts to run effectively at scale
Kubernetes doesn't really provide everything you need without building a developer platform on top of it

Here's the most concise heuristic I can come up with to determine if Kubernetes is an appropriate solution for you:

Kubernetes is almost certainly the wrong choice for a small team, or startup who's rapidly iterating on their business model, and/or anyone who's running a small number of distinct workloads (e.g. a single monolith)
Kubernetes is almost certainly a great choice for a larger team (~30+ developers), companies with microservices architectures (or more than a few independent high-availability workloads), where the overhead of building a platform can be amortized across all your developers and workloads

If you're in the area in between these two points, there's probably a bunch of context, trade-offs, and other factors to consider.

The actual whole point of Kubernetes

Here's what I think is the most relevant property of Kubernetes to feed into the decision:

Kubernetes provides an abstraction layer between developer concerns and operational concerns.

It provides a rich set of primitives that describe workloads in terms that developers care about (e.g. memory/CPU requirements, auto-scaling factors, health checks, etc) that make them utterly consistent from an operational perspective. This seemingly simple separation of concerns provides very significant operational capabilities that don't exist in a heterogeneous environment.

Once you've gotten to a certain level of maturity with Kubernetes, a platform team can do pretty major changes and upgrades of the platform and its components across the company's entire portfolio of applications - safely - with confidence that it won't affect customers. Even more, this can be done without much (if any) coordination with developer teams!

For example, a platform team could run OS upgrades, deploy new security/compliance tools, swap out observability infrastructure, or make significant changes to the network design- across your entire fleet of services- safely and without disruptions. If you have a lot of different workloads (regardless of the scale of each workload), doing this kind of thing without Kubernetes can be extremely costly, time consuming, and scary.

Kubernetes is a platform for building platforms

The platform my team has built on Kubernetes was inspired by SaaS platforms like Heroku or Fly.io, and are designed to feel serverless. Developers don't have to know a lot about the underlying capabilities of the cloud provider, how ingress is implemented, TLS certificates, load balancing, reserved vs spot instances, how logs, metrics, or traces are captured, etc. They only have to develop against a specific contract, the core of which is a small subset of the capabilities exposed by Kubernetes, which we simplify a great deal through intelligent defaults, validation, and scaffolding.

I don't want to understate the effort we've had to put in to get to this point, but the result is that our developers really like this platform- and are measurably happier and more productive across a whole bunch of objective dimensions. In internal engagement surveys, it's one of the most universally praised aspects of life in our engineering team.

This said, my company has about ~150 developers working on cloud-based services. We manage a huge fleet of microservices that communicate with IoT devices across the world. In our case, the cost of building a great developer platform on Kubernetes is absolutely dwarfed by the cost of the alternatives (a number of which we've used in the past). It's been an enormous win for us, both in terms of customer experience, reliability, developer happiness, security, cost management, and ease of operations.

If your company looks like this (or larger), an internal platform built on Kubernetes is likely a great choice.

No Kubernetes for you

All this said, here's the other side:

I'm not sure if I've got another startup in my future, but if I did, the odds that I would be using Kubernetes is pretty close to zero. I'd be focusing on finding a business model that made customers happy, and every second I spent on operational concerns would be directly in competition with that goal. The abstraction layer between developer and operational concerns wouldn't provide any value in this situation, since a small number of people would be managing both. I'd probably choose the simplest deployment/operational tools available, and I'd spend as little time worrying about scaling as possible.

But later on… if we were successful, had found our market, and we were scaling up (e.g. hiring beyond our ~30th developer), and we had more than a few distinct workloads… I'd start thinking about Kubernetes again.

The moral of the story

If Kubernetes seems like an over-engineered, complicated, and inscrutable solution to a problem you don't think you have… you probably don't. Kubernetes isn't for you. But there's a reason it's used so broadly, and it's not because everyone is dumb and distracted by buzzwords.

Hopefully we can raise the level of discussion on Hacker News around Kubernetes to match that of a random (but admittedly delightful) article about mechanical watches.

How I quit painting and became a computer geek

Sat, 05 Dec 2009 00:00:00 GMT

For those of you who knew me as a painter (up until about 1999), you may be confused to hear that I quit painting completely and I'm now working as a software engineer.

So here's the deal: My junior year in college at Cooper Union (an art school), after having already spent close to 7 years painting seriously, I went online for the first time- did some email, played around on the web. I decided in order to prevent myself from becoming a total Luddite, I should learn some stuff about using computers to make art. So I started with photoshop, a little illustrator and flash. After a semester, it occurred to me that this might lead to a reasonably comfortable day job to help support my painting habit.

So I took some classes in graphic design my senior year, and promptly got myself a job to support myself when I started grad school. The first thing I did was built a website for Samsonite, which went well. I started working with a variety of engineers and graphic designers at a consulting firm which built websites for practically every major casino in Vegas. This, of course, wouldn't have been my first choice, but beggars (or painters) can't be choosers.

Meanwhile, I began to play with 3D animation and enjoyed it quite a bit. I started using a language called MEL (Maya Embedded Language), which was really my first programming language (not counting HTML, which is really a markup language, but that's neither here nor there). I began to slowly realize that I was more interested in this and my job than what I was doing officially in grad school (which was still painting).

Painting became less and less pleasurable. The kind of paintings I wanted to make became increasingly difficult to execute… to the point that I started dreading going to the studio. One day, while working on a painting, I sat down for a little break and decided I was done. Then I decided that I was really done. No, REALLY done. No more painting. I reached the end of the road, I had painted everything I wanted to paint.

I took a year off of grad school and focused on my job and making a short animated “film”. In both areas, I increasingly toyed with programming, finding that once you got used to it, it was quite interesting, fun, and powerful. I had become a project manager by that time, and I stopped hiring programmers to do simple tasks and instead did them myself. This went on for a couple more years while I finished grad school, having used my animated short as a thesis project.

Then I moved to Boston to be with my girlfriend (now wife), Kate. I took a job at a company called CardScan with the intention of doing what I had before, project management, a mix of design, usability, web development and a splattering of programming. They had a different idea, and I soon found myself part of an engineering team, doing full-on, soup-to-nuts web development.

So 8 years, half a dozen programming languages, and a couple jobs later, I find that I can code circles around lots of kids that come out of top-notch engineering schools. Though my real “expertise” in engineering is UI (user-interface, which is really kinda art related), a significant amount of my day-to-day work is real software engineering. I find that it actually satisfies the same parts of my brain that painting used to (problem solving, critical thinking, extracting beauty from chaos).

In retrospect: While making art was something I was good at, I didn't ever really like the way that some people like it. I never found it cathartic, or liberating, fulfilling, or any of the other things my artist friends claimed it was to them. When I was a kid, drawing was simply a way for me to visualize the things I wanted to have and/or build (often robots and spaceships). Engineering allows me to build things I'm interested in- so drawing no longer serves a purpose as an end unto itself.

Art, however, is not unlike a chronic disease (i.e. malaria) in that it forever colors your perspective, and you can't ever get rid of it. I spent so long learning how to manipulate, pervert, glorify, and distort perceptions, that now- for me, bullshit glows in the dark. This is surprisingly useful as an engineer.

Software, thoughts, and stuff Blog

Incremental IPv6 with Kubernetes

Kubernetes eats IPs for breakfast​

Actually, IPv6 is a thing​

Enter EKS IPv6 mode​

What does migrating an EKS cluster to IPv6 require?​

A few basics​

Your OS and language probably supports IPv6​

Unfortunately, not all apps use dual-stack by default​

IPv6 cheat sheet​

aws-load-balancer-controller​

ingress-nginx​

MongoDB​

Redis​

MariaDB​

LocalStack​

RabbitMQ​

nginx​

OTEL collector​

Jaeger​

WireMock​

Gradio​

Uvicorn​

More IPv6 cheat sheet examples, please!​

What would an OSS developer platform even look like?

So, tell me more about this platform​

Our Kubernetes distribution​

dex: the CLI tool​

The impact of dex at SimpliSafe​

A platform design reflects a company's culture​

Example: ephemeral environments​

What are ephemeral environments?​

Ephemeral environments in a financial institution​

Ephemeral environments at an IoT security company​

Different tradeoffs, different design​

What would this look like open-sourced?​

Option 1: A hyper-opinionated "PaaS in a box"​

Option 2: A whole platform, but pluggable​

Option 3. A toolkit for building your own platform​

A plea for help​

Building culture is hard, sustaining it is harder

Someone should go ask Rob why this one button is purple​

Dan builds a wiki​

The wiki takes off​

Imagine a world where all the shit is written down​

Entropy affects culture too​

Tips for building your own culture of knowledge management​

Minimize friction for contributors​

Access control is counter-productive​

Don't be precious with content​

Prefer a single source of truth​

You need a jump-start to get critical mass​

You need a team to sustain and nurture it​

Let's extrapolate​

That one time I did something important

The penalty for optimizing your career for joy​

Do you remember Web 2.0?​

The origin of Vistaprint's studio​

Studio in 2006​

Hitting rock bottom​

Kindling and a spark​

The pitch​

Building the new Studio​

The results​

Reflecting on this success​

Innovation comes from individuals​

Developer experience is a product

Developer experience in the monolith​

Microservice babies​

Building a great platform for the wrong customer​

Kubernetes emerges from the chaos​

Three developer platforms later, lessons learned​

Finding product-market fit​

Congratulations, you're a brand manager​

Remember, you don't have the monopoly you think you do​

Prometheus vendor death match

The landscape​

The dominant players​

The up-and-comers​

The cloud-provider option​

Kubernetes eats IPs for breakfast

Actually, IPv6 is a thing

Enter EKS IPv6 mode

What does migrating an EKS cluster to IPv6 require?

A few basics

Your OS and language probably supports IPv6

Unfortunately, not all apps use dual-stack by default

IPv6 cheat sheet

aws-load-balancer-controller

ingress-nginx

MongoDB

Redis

MariaDB

LocalStack

RabbitMQ

nginx

OTEL collector

Jaeger

WireMock

Gradio

Uvicorn

More IPv6 cheat sheet examples, please!

So, tell me more about this platform

Our Kubernetes distribution

dex: the CLI tool

The impact of dex at SimpliSafe

A platform design reflects a company's culture

Example: ephemeral environments

What are ephemeral environments?

Ephemeral environments in a financial institution

Ephemeral environments at an IoT security company

Different tradeoffs, different design

What would this look like open-sourced?

Option 1: A hyper-opinionated "PaaS in a box"

Option 2: A whole platform, but pluggable

Option 3. A toolkit for building your own platform

A plea for help

Someone should go ask Rob why this one button is purple

Dan builds a wiki

The wiki takes off

Imagine a world where all the shit is written down

Entropy affects culture too

Tips for building your own culture of knowledge management

Minimize friction for contributors

Access control is counter-productive

Don't be precious with content

Prefer a single source of truth

You need a jump-start to get critical mass

You need a team to sustain and nurture it

Let's extrapolate

The penalty for optimizing your career for joy

Do you remember Web 2.0?

The origin of Vistaprint's studio

Studio in 2006

Hitting rock bottom

Kindling and a spark

The pitch

Building the new Studio

The results

Reflecting on this success

Innovation comes from individuals

Developer experience in the monolith

Microservice babies

Building a great platform for the wrong customer

Kubernetes emerges from the chaos

Three developer platforms later, lessons learned

Finding product-market fit

Congratulations, you're a brand manager

Remember, you don't have the monopoly you think you do

The landscape

The dominant players

The up-and-comers

The cloud-provider option

The goldilocks zone

Evaluating Chronosphere

The sheer volume of metrics

The Chronosphere control plane

Using Chronosphere

Evaluating Grafana Cloud

Using Adaptive Metrics