What would an OSS developer platform even look like?
My team has built a developer platform that our developers really like, and is providing a ton of value for my company. But I'm struggling to figure out if and how we might open-source it. I'm looking for advice from you.
As a platform engineer, I enjoy the benefits of working in a field with a vibrant ecosystem of open source infrastructure and developer tools. I've spent much of the last decade building developer platforms by curating and assembling these tools, and after a number of iterations, I seem to have hit on something that's working really well for my current company (SimpliSafe).
As our platform's adoption has grown, we've gotten more and more frequent, really positive, heartwarming feedback from our developers who really like it. This is absolutely freaking delightful, and honestly never stops surprising me.
I often get asked by our developers if we should consider open-sourcing the platform. I've spent some cycles entertaining the idea, but I usually don't get very far before it seems unworkable.
This post is an experiment in thinking in public; I'd like to brain dump my thoughts on the challenges of building an open-source developer PaaS, in the hopes that the platform engineering community might provide some insight to get me past this block.
So, tell me more about this platform
Our platform is named "dex/EKS", which is (an admittedly awkward) combination of the name of the client tool, "dex", with the AWS service the server-side is built on (EKS: AWS's managed Kubernetes service). Unsurprisingly, developers tend to just call the whole thing "dex".
In the spirit of the "Platform Engineering" buzzword, dex/EKS encapsulates our company's collective opinions, policies, and best-practices for building, deploying, and operating apps. I like to think of it as a PaaS that we've curated and glued together out of a bunch of open-source and vendor tools.
dex
(the client tool itself) is a CLI tool for interacting with the platform. Picture the flyctl
, vercel
, or heroku
CLI.
dex
is intentionally lowercase, or as I like to call it: "hipster-case". Or maybe camelCase without the humps? I dunno. It's a thing.
Our Kubernetes distribution
In addition to the client tooling, we also have a fairly sophisticated Kubernetes "distribution", which consists of a bunch of curated cluster-side components, combined and configured to work well together. Our cluster configuration is defined with Terraform, and we use it with Github Actions to manage many dozens of EKS clusters. Beyond that, there's integrations with a bunch of third-party SaaS providers, including AWS services and other vendors.
Just to give you a sense of the ingredients that comprise the platform, here's a partial list:
- Kubernetes (AWS EKS)
- Docker/BuildKit
- Github Enterprise (with Github Actions)
- Okta
- Artifactory
- Grafana Cloud
- Honeycomb
- OpenSearch (for logs)
- A bunch of AWS services (ECR, S3, SSM, SecretManager, Route53, ACM, WAF, etc)
Here's a few of the tools (from the Kubernetes ecosystem) we use in our EKS configuration:
- aws-load-balancer-controller
- ingress-nginx
- Karpenter
- external-dns
- external-secrets
- Fluent Bit
- OTEL (OpenTelemetry) collectors
- KEDA
- Kyverno
- Velero
dex: the CLI tool
The CLI tool abstracts and integrates the APIs of these infrastructure components and exposes them through a simplified, declarative set of configuration and CLI commands. Some of the things it handles:
- Configuration management (what settings your app gets in different environments)
- Secrets management
- Cross-platform and multi-platform container image builds
- User authentication (i.e. SAML auth via Okta, to Kubernetes, AWS, and Artifactory)
- AWS IAM integration (allows you to assign AWS permissions to your app)
- Kubernetes manifest management (imagine a simplified version of helm)
- Ingress management (load balancers, certs, and DNS)
- Vulnerability scanning
- CI/CD integration (Github Actions)
- Telemetry pipeline integration
- Docker/Kubernetes-based integration testing framework
One of the key metrics we hold ourselves to is that developers have to be able to get a new "hello world" service up and running in less than 10 minutes, at which point they can turn their focus to business problems. As they get closer to production, they have a few more decisions to make about autoscaling, observability, etc, but for the most part, the platform narrows down the choices to just a few fully-baked, meticulously documented, well-trodden paths.
dex has some other extensibility mechanisms for more advanced use cases, such as the ability to author custom commands with arbitrary TypeScript, which can re-compose existing dex commands and any of it's constituent APIs.
Teams sometimes use this extensibility to explore the frontier of what's possible. If they find a new pattern to be useful, we will often incorporate it into the platform.
For example, dex's multi-region DNS configuration support was originally built by another team, who then contributed it upstream, so everyone else in the company could use it.
The impact of dex at SimpliSafe
Teams at SimpliSafe have migrated the majority of our services to dex/EKS, and most teams are planning on moving their remaining services over in the next year or so. This has happened with close to zero pressure from management; teams are moving their services to the platform because they're much happier with it than without it.
Suffice to say, I'm very proud of this outcome, and dex seems to be providing a lot of value.
A platform design reflects a company's culture
While it may appear that dex is just a set of choices about tools and services that have been manifested in glue code, it also reflects SimpliSafe's values and organizational culture. While most of these values and cultural properties fairly well subscribed, they're by no means universal:
- Teams should have a great deal of autonomy to choose tools, languages, frameworks, and processes, and they should be accountable for operating the systems they build
- Continuous delivery is better than big-bang releases
- Microservices are a good way for a large team to build a big system
- A central team should own cross-cutting concerns like telemetry pipelines and observability backend tools, auth, infrastructure provisioning, etc
- Infrastructure should be represented as code and managed through automation
There's a bunch of others, but you get the picture.
These values deeply inform the design of dex, and it's interesting to look back on the iterations of platforms that I've built at different companies with different values, and how they were reflected in the design of the platforms.
Example: ephemeral environments
Ephemeral developer environments are a feature I 100% know is a huge win for developers, regardless of your company culture. But there have been big differences in design, features, and implementation details when I've built this feature at different companies.
What are ephemeral environments?
Here's the gist: A developer should be able to deploy their app into an isolated, temporary environment from their local machine into Kubernetes, with a single command, so they can iterate, tweak, and test ideas in a production-like environment. They'll need a URL to access the app, so they can play with usability, do manual testing, attach a debugger, troubleshoot config, etc. When they're done, they use another command to tear the whole thing down.
Additionally, for every feature branch they push to Github (or Gitlab, Bitbucket, etc), a CI job will deploy an ephemeral environment with the same properties, run automated tests, and tear it all down when complete (or after a specified period of time).
Ephemeral environments in a financial institution
At my last job, in a commercial bank, we had a very small number of tightly controlled, multi-tenant Kubernetes (OpenShift) clusters. As you might imagine in a highly-regulated industry, the creation of Kubernetes namespaces (and the associated access controls) was governed by security controls, required approvals, and needed to leave an audit trail. The process was managed by a central team.
To allow creation of dynamic, isolated environments, we worked within the static structure of centrally managed namespaces by designing our tooling to generate Kubernetes objects using strict naming conventions (e.g. prefixing all resources with the name of the developer or feature branch). This allowed the tooling to manage the objects as a unit, avoid collisions, and ensure that the objects were cleaned up when the developer was done.
This design decision trickled into many other aspects of the system. For instance, we designed our client tooling to maintain pretty tight control over rendering the objects, and the relationship of objects via names, labels, and selectors.
Ephemeral environments at an IoT security company
At SimpliSafe, the company's culture and preexisting architecture enabled a very different approach: ephemeral environments are implemented via Kubernetes namespaces, and the client tooling can create (and destroy) namespaces dynamically.
Because we have per-team AWS accounts, and our Kubernetes clusters already provide strong isolation, we're comfortable giving developers the power to manage namespaces in our non-production environments. This removes a lot of the need for strict control over object relationships in Kubernetes, and gives developers more flexibility to mess with the underlying objects more directly.
This additional power is a reflection of SimpliSafe's culture of autonomy and trust in developers.
Different tradeoffs, different design
So even with the same feature, providing pretty similar benefits to developers, we had to make very different tradeoff decisions, and ended up with design differences which significantly impact the architecture, features, and feel of the rest of the platform.
What would this look like open-sourced?
Given the two examples of companies with different infrastructure opinions, let's think through the possible flavors of open-sourcing a developer platform like dex:
Option 1: A hyper-opinionated "PaaS in a box"
This option assumes that the infrastructure decisions we've made at SimpliSafe would be a good fit for at least a bunch of other companies, with minimal modification. We'd provide the whole thing, end-to-end, including the EKS cluster configuration and terraform, all the cluster-side system components, and the dex client-side tool.
I find this option hard to imagine for a few reasons:
- While I'm very confident we've got a great solution for SimpliSafe, I think it's virtually impossible that any other company would be happy with all our opinions (the bank certainly wouldn't have been). Our platform glues together scores of specific OSS products (and a number of SaaS vendor tools), and the odds that every one of them lines up with another company's preferences is close to zero.
- A platform engineering team using this version of the platform would be signing up to build expertise and support every OSS production we've chosen.
- While out-of-the-box, opinionated platform might be good for a startup, our platform is certainly NOT the the right choice for a startup. It's designed around supporting many teams, and to allow a central platform engineering team to manage infrastructure underneath teams' apps... which is not the problem engineers at a startup should be worrying about.
- Among the opinions encapsulated in our platform are some we're not happy about. We have a few compromises based on legacy infrastructure choices that are hard to change, and some choices which are an intermediate phase between where we are and where we want to go. For example, we're currently using the telegraf-operator to collect metrics for lots of our services, but we'd prefer to be using OTEL SDKs and/or Prometheus libraries.
I actually can't imagine myself choosing to use someone else's OSS platform if it were built on this philosophy.
Option 2: A whole platform, but pluggable
In this variant, we'd provide also provide the whole platform, but allow users to bring their own infrastructure opinions via a plugin API.
I also see some big disadvantages here:
- Abstraction layers add complexity. Part of the value of dex is that the code is relatively simple, straightforward, and hackable. We often get a PR or feature request, and end up cutting a new release within hours. This would not remain the case if we started adding abstraction layers everywhere.
- Testing and maintaining compatibility with all possible plugins would be a huge burden. Right now, dex's integration tests are both comprehensive and fast, and it would be virtually impossible to maintain this level of coverage if we had to test against an ecosystem of plugins.
- It's really hard to build good abstraction layers, even for simple things. And these infrastructure components are definitely not simple. We'd be constantly expanding and modifying the APIs to support additional opinions, and the abstractions would inevitably leak.
- Many of the infrastructure choices we've made allow us to simplify the design of the platform, and these simplifying assumptions wouldn't be valid if we allowed arbitrary plugins. Tight coupling, in this case, is part of the special sauce for creating a really streamlined and cohesive developer experience.
- Comprehensive documentation would be much more complicated and far less useful, since docs would have to simultaneously support the perspective both of the platform developer as well as the end-user developer. dex has lots of docs based on developer use cases, and it wouldn't be possible to provide these if the whole experience were built on plugins.
- Kubernetes APIs already provide so much power and extensibility, and especially when you throw in Operators, CRDs, and custom controllers, it's hard to imagine how I could provide APIs that would support all the flexibility Kubernetes offers.
I think realistically, this solution would start as option 1 and then abstractions would be gradually added by contributors to support their particular infrastructure choices, so it's probably best to think of this option as a spectrum with more or less pluggability.
Option 3. A toolkit for building your own platform
Another option is to factor out individual components of the platform as standalone libraries, and let people build their own platform. I could imagine some of dex's components being useful for someone who wants to build a different opinionated platform.
One example of a generally useful component is our config system:
- Our config schema is defined as a tree of TypeScript classes, which can be used to generate a JSON schema, which can be used by other tooling to provide instant validation (e.g. via VSCode's JSON schema integration), to validate at runtime, an also to generate documentation.
- The config system supports defining arbitrary target environments, which can use inheritance (and other mechanisms) to share common settings, and override them as needed.
- It has a mechanism for declaring dynamic config values (e.g. a value from a Parameter Store secret, or based on the current git branch name, etc).
- The config loader returns a config tree object which is built out of JavaScript Proxy objects, which allows us to do very smart validation, with user friendly error messages, and play to TypeScript's strengths.
That said, turning this config module into a separate npm package would have some tradeoffs:
- The inherent packaging tax: working with multiple npm packages is more complex to develop, debug, and test locally.
- It's abstractions would feel leaky to a user. For example, JSON schema generation requires some special build configuration, and this would appear a bit finicky if it was intended to be used off the shelf.
- There's some other aspects of our config system that are currently tightly coupled with other parts of the platform. This is all just code, so of course we could figure out how to decouple it, but there would be a decent amount of net-new complexity as a result.
More generally, I think the challenge with this approach is that most of the value of the platform stems not from the individual components, but from their integration. For example:
- dex's packaging/distribution mechanism (a stable CLI + a fast-moving, versioned library) has many moving parts
- dex's own build system is very sophisticated, and has a lot of features around building canary releases, and enabling debugging in a sample host project
- dex also has a suite of integration tests are fairly involved and comprehensive
- The documentation of dex's UI (both its command line interface and config interface) is a huge factor in dex's success at SimpliSafe, and would have to be built from scratch for a new platform.
A plea for help
So I'm sitting on this great set of tooling, which is providing a ton of value for my company. It's built on OSS, public cloud, and SaaS services, and there's no proprietary magic or novel intellectual property we're trying to protect. It solves a problem that a huge number of medium-large technology companies would have to tackle.
Why can't I see a way to share this with the world? Maybe I'm just not being imaginative enough. I'm 100% certain this isn't a novel situation.
What do you think?