<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="rss.xsl"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Software, thoughts, and stuff Blog</title>
        <link>https://labaneilers.com</link>
        <description>Software, thoughts, and stuff Blog</description>
        <lastBuildDate>Sun, 15 Mar 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright © 2026 Laban Eilers</copyright>
        <item>
            <title><![CDATA[Thinking never goes out of style]]></title>
            <link>https://labaneilers.com/thinking-never-goes-out-of-style</link>
            <guid>https://labaneilers.com/thinking-never-goes-out-of-style</guid>
            <pubDate>Sun, 15 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[I've found myself still occasionally hand-writing some code, even though I've gone almost entirely all-in on AI-assisted engineering. I'm considering the value of programming by hand as a cognitive tool, much like writing, that can help facilitate deep thinking, combat biases, develop instincts, and lean into one's strengths as a human being in an age of token-chomping bots.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>I've found myself still occasionally hand-writing some code, even though I've gone almost entirely all-in on AI-assisted engineering. I'm considering the value of programming by hand as a cognitive tool, much like writing, that can help facilitate deep thinking, combat biases, develop instincts, and lean into one's strengths as a human being in an age of token-chomping bots.</p></div></div>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/writer-b93ff6bd4193fdef01bc1952d6f79bcd.jpg" alt="A writer"></figure>
<p>I haven't written much for the past few months- for a couple reasons.</p>
<p>First, I've become utterly addicted to AI coding tools, and I haven't been able to pull myself away from watching my AI agents <strong>absolutely tear through my feature backlog</strong>. Suddenly nothing can escape my reach, from long-postponed major refactoring, to minor annoyances, to features that were tricky enough I wasn't sure they'd ever make sense to tackle.</p>
<p>Even my old nemesis, CSS, is no match for me on a dopamine-fueled, manic, AI coding rampage.</p>
<p>Secondly, while I'm learning an insane amount with this stuff, it feels like all my mind-blowing insights have a shelf-life of about 5 minutes. Everyone else in the world seems to be coming to all the same conclusions. The next morning, everything I write just seems painfully <em>obvious</em>.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="please-dont-tell-anyone-but-i-wrote-this-gasp-by-hand">Please don't tell anyone, but I wrote this (gasp) by hand<a href="https://labaneilers.com/thinking-never-goes-out-of-style#please-dont-tell-anyone-but-i-wrote-this-gasp-by-hand" class="hash-link" aria-label="Direct link to Please don't tell anyone, but I wrote this (gasp) by hand" title="Direct link to Please don't tell anyone, but I wrote this (gasp) by hand" translate="no">​</a></h2>
<p>Despite being neck deep in a frantic, explosive bout of creativity, I've occasionally taken some time to go a bit beyond just reviewing the code Claude's writing for me, and to think about it a bit more deeply. A few times I've even caught myself typing- realizing that I'd been at it for a half-hour or so- without noticing how objectively <em>weird</em> it is to be doing that when I have an army of tireless, magical code-writing gremlins at my disposal.</p>
<p>I realize it might appear that I'm wasting time in an anachronistic attempt to hand-craft some artisanal TypeScript- to infuse it with some humanity and love... or some shit like that. Or perhaps that I'm just having trouble letting go of the joy I used to experience from the puzzle-solving aspect. I mean, it might be a little of those things- at least it was at first. But it isn't anymore.</p>
<p>Within only a few months after committing myself fully to getting good at applying AI to engineering, my brain has already been completely rewired. A few months ago, AI-assistance was getting me at most a 20% productivity boost. Now, with better tools, a new mindset, and lots of practice- it's conservatively like 5-10x.</p>
<p>The other day, when I set out the ingredients in preparation for cooking dinner, I was <em>viscerally disappointed</em> when I realized I couldn't spawn subagents to chop the onions, peppers, and carrots.</p>
<p>Don't lie, you know this has happened to you too.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="old-habits">Old habits<a href="https://labaneilers.com/thinking-never-goes-out-of-style#old-habits" class="hash-link" aria-label="Direct link to Old habits" title="Direct link to Old habits" translate="no">​</a></h2>
<p>I've been ruminating on why I still occasionally revert to my old habits. I know the code I write by hand isn't any better than what Claude would do, and I'm absolutely sure I could have achieved the same outcomes (certainly faster) by prompting it. But at the speed the agents were moving, something felt off about the how much time I was spending thinking <em>deeply</em> about anything.</p>
<p>And I think, until now, I hadn't appreciated how much programming could be like writing: <em>a tool to facilitate thinking</em>.</p>
<p>Paul Graham once wrote, in a <a href="https://paulgraham.com/writing44.html" target="_blank" rel="noopener noreferrer" class="">tiny essay on writing</a>:</p>
<blockquote>
<p>I think it's far more important to write well than most people realize. Writing doesn't just communicate ideas; it generates them. If you're bad at writing and don't like to do it, you'll miss out on most of the ideas writing would have generated.</p>
</blockquote>
<p>This hits home. I abort roughly 2/3 of my attempts to write blog articles, mostly because the process of writing helps me see how fucking stupid some of my own ideas are. Sometimes I start writing out a thought, and find that it's become completely unrecognizable by the time I'm done- because writing forced me to think it through. Seeing my own thoughts in concrete form allows me to feel how they might land on another person, which helps weed out all the bullshit and expose what seems true.</p>
<p>This is <em>valuable</em>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="programming-is-also-thinking-made-concrete">Programming is also thinking made concrete<a href="https://labaneilers.com/thinking-never-goes-out-of-style#programming-is-also-thinking-made-concrete" class="hash-link" aria-label="Direct link to Programming is also thinking made concrete" title="Direct link to Programming is also thinking made concrete" translate="no">​</a></h2>
<p>When I'm really in the zone while programming, I float between a meditative, dissociative state, and a more analytical, critical one. It forces me to engage with a problem in abstract terms, and then pop back up and consider the broader effects of the changes I'm making.</p>
<p>This mindset gets to the core of what humans add to the AI-assisted engineering equation. I'm still way better than Claude at knowing when to step back and ask questions like:</p>
<ul>
<li class="">what tradeoffs come along for the ride with this change (especially tradeoffs beyond the knowledge and context window of the LLM)?</li>
<li class="">what are the intrinsic relationships/coupling between entities, and how well does this model the corresponding concepts in the real world?</li>
<li class="">what will the 2nd order effects of this change be?</li>
<li class="">what social/emotional/behavior outcomes am I actually looking for?</li>
<li class="">are there assumptions built into my idea that I could test with less risk?</li>
</ul>
<p>But without any time spent with my mind immersed in types, data structures, or algorithms, I found it was really easy to get swept up in a what felt like a creative frenzy, only to later realize that I was too disconnected from the details of the problem, and my instincts ended up being <em>all wrong</em>.</p>
<p>Code relates to the real world in all kinds of ways- some obvious, and some much more subtle. Human intuitions around something this complex require some effort to nurture.</p>
<p>Sometimes, the <em>inherent difficulty</em> of programming, like writing, is epistemologically valuable. When tools remove all friction, they can also remove <em>the struggle that creates the opportunity for insight</em>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ai-authored-patches-can-anchor-your-conceptions">AI authored patches can anchor your conceptions<a href="https://labaneilers.com/thinking-never-goes-out-of-style#ai-authored-patches-can-anchor-your-conceptions" class="hash-link" aria-label="Direct link to AI authored patches can anchor your conceptions" title="Direct link to AI authored patches can anchor your conceptions" translate="no">​</a></h2>
<p>A related effect of relying primarily on agents is that the code diffs they generate can reinforce and calcify your preconceptions about a problem space.</p>
<p>Another relevant quote, from George Orwell's essay, <a href="https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/politics-and-the-english-language" target="_blank" rel="noopener noreferrer" class="">Politics and the English Language</a>:</p>
<blockquote>
<p>But if thought corrupts language, language can also corrupt thought.</p>
</blockquote>
<p>Orwell refers here to "ready-made phrases" (e.g. cliches or idioms) whose power, by virtue of tradition or sheer catchiness, can infect patterns of thought and leave their victims vulnerable to sloppy thinking- and prone to drawing illogical conclusions.</p>
<p>I feel the echoes of this phenomenon when I read LLM-generated code. While they can certainly generate code that's elegant and idiomatic, it also has the side-effect of anchoring us into a particular conception of <em>how a problem is shaped</em>.</p>
<p>Since LLM output is largely a product of the language we use to prompt it (and the context we give it explicity), it can be easy to accidentally produce output that's a reflection of our own (perhaps flawed) mental model. Plus, because LLMs are trained on all the code in the universe, we're likely to get middle-of-the-bell-curve ideas as outputs, unless we work deliberately to push it towards something more interesting.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="cognitive-tools-old-and-new">Cognitive tools: old and new<a href="https://labaneilers.com/thinking-never-goes-out-of-style#cognitive-tools-old-and-new" class="hash-link" aria-label="Direct link to Cognitive tools: old and new" title="Direct link to Cognitive tools: old and new" translate="no">​</a></h2>
<p>Recognizing this, I've also developed plenty of <a class="" href="https://labaneilers.com/using-ai-to-promote-divergent-thinking">tricks to use LLMs as a critical, antagonistic thinking partner</a>: one that helps me poke holes in my own ideas and helps me explore divergent paths. This is, for me, a really amazing new tool to amp up my own creativity- and to combat my own cognitive biases.</p>
<p>But I don't think it's the only tool available. We've got tens of thousands of years of human history of creativity and critical thought behind us. Long before AI started spewing out probabilistic responses, we developed all kinds of tools and processes to facilitate innovation- and weed out bad ideas.</p>
<p>Written language has been a huge one- for several thousand years. Programming, which has been around for (depending on your definition) something like 80-180 years, has a lot of similar properties, certainly in communicative power, but also as a method to visualize concepts, force logical reasoning, and root out ideas that don't hold water.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="human-thought-is-still-in-style">Human thought is still in style<a href="https://labaneilers.com/thinking-never-goes-out-of-style#human-thought-is-still-in-style" class="hash-link" aria-label="Direct link to Human thought is still in style" title="Direct link to Human thought is still in style" translate="no">​</a></h2>
<p>It can be easy to think of code only in terms of its ostensible purpose- a way to describe processes to be executed by a machine- and forget that it's also an interface designed for humans to be able to <em>express</em> these processes and <em>reason</em> about them. Programs exist in a much broader context than LLMs can be aware of, and humans are still far better at working at that level of abstraction.</p>
<p>I feel like there's some value to me in retaining a connection to code. For several decades, marinating in data structures and algorithms gave my brain time to absorb and synthesize concepts, explore spaces- imaginary and real- and serendipitously stumble upon new perspectives.</p>
<p>Its possible I'll begin to see spec-writing (i.e. "programming in English") as the successor to programming in this way: an activity that forces deep reflection and forges clarity of thought. I'm open to this, and I intend to give it a real go.</p>
<p>But there's something about the <em>precision</em> of programming languages, versus natural languages, in particular, that has a tendency to activate different parts of my brain- which has been genuinely invaluable in helping me solve problems that other cognitive modes didn't- on their own. This isn't something I'm quite ready to give up.</p>
<p>Even if it sometimes feels like an anachronism.</p>]]></content:encoded>
            <category>AI</category>
            <category>programming</category>
        </item>
        <item>
            <title><![CDATA[Using AI to promote divergent thinking]]></title>
            <link>https://labaneilers.com/using-ai-to-promote-divergent-thinking</link>
            <guid>https://labaneilers.com/using-ai-to-promote-divergent-thinking</guid>
            <pubDate>Thu, 11 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[A possible explanation for why we observe such a discrepancy between the productivity gains from AI tools of seasoned senior software engineers versus their less-experienced counterparts:]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>A possible explanation for why we observe such a discrepancy between the productivity gains from AI tools of seasoned senior software engineers versus their less-experienced counterparts:</p><ul>
<li class="">Inexperienced problem solvers have a tendency to get stuck when using AI tools to solve complex problems because they bias towards <em>convergent thinking</em>: narrowing possibilities and looking for a specific solution which fits within their preconceptions.</li>
<li class="">Experienced problem solvers who have adopted AI tools tend to use it to augment a scientific process, which includes building mental models, formulating hypotheses, and performing experiments. In this process, they use AI to accelerate <em>divergent thinking</em>: exploration and discovery of possibilities.</li>
</ul></div></div>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/heads-up-a4f16618be96999f5ff7d8582ecdfb9c.jpg" alt="A heads-up display"></figure>
<p>I've been thinking a lot lately about the significant discrepancy I've observed in how effective different people are at using AI for problem solving. I'm certainly not the only one; there seems to be a strong consensus building in my field (software engineering) that AI tools disproportionally augment the effectiveness of experienced people (e.g. those who developed their craft before AI tools became omnipresent), and that providing AI to people early in their career can actually slow them down- and impede their growth.</p>
<p>This certainly fits the observations I've made personally, but raises a lot of questions about what experienced people are doing so differently with AI. I've spent some time intermittently reflecting on my own use of AI, and also paying attention to how some of my younger colleagues are using it, and I think I have a working theory.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="behold-science">Behold: Science!<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#behold-science" class="hash-link" aria-label="Direct link to Behold: Science!" title="Direct link to Behold: Science!" translate="no">​</a></h2>
<p>Like many people who spend their careers solving technical problems, I employ a version of the scientific method:</p>
<ol>
<li class="">I start with my existing mental model of how a system works (based on previous observations). Like any scientific model, this is always varying degrees of incomplete/incorrect.</li>
<li class="">I use my model to formulate hypotheses about what changes to the system will result in a solution to the problem I'm working on.</li>
<li class="">I decide which of the hypotheses I should test first. This incorporates several factors:<!-- -->
<ul>
<li class="">the hypothesis's likelihood of being correct</li>
<li class="">how difficult and time consuming it will be to validate</li>
<li class="">what side-effects or tradeoffs come along for the ride</li>
</ul>
</li>
<li class="">I test the best hypothesis and use the findings to improve my mental model.</li>
<li class="">Problem solved... or... rinse and repeat with my refined model.</li>
</ol>
<p>This isn't just for problems of a certain granularity; it's fractal, and works at any level. I approach both multi-month and 5 minute problems this way (with varying degrees of rigor, depending on how well caffeinated I am). As I break down big problems into smaller ones, recursively, these tools (mental model, hypotheses, running tests) maintain their value.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-forking-road">The forking road<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#the-forking-road" class="hash-link" aria-label="Direct link to The forking road" title="Direct link to The forking road" translate="no">​</a></h2>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/forking-road-09d57e4cb1a77d0924c21c5ebfcc2c52.jpg" alt="A forking road"></figure>
<p>I think of problem solving like traveling a road that branches into infinite forks. Each fork has different probabilities of success, different risks, and different tradeoffs. Some have hidden dangers, or they might be really exceptionally long and difficult to travel. On top of that, any of them could be dead ends.</p>
<p>Following any fork- even the dead ends- could reveal valuable information, which you can use to refine your mental model (which in this metaphor would be something like an aerial map). But it costs time and energy to take any one, so you need a methodology to make each decision as efficiently as possible.</p>
<p>If there's a better way than the scientific method, I'm not aware of it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-experienced-problem-solvers-do-differently">What experienced problem solvers do differently<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#what-experienced-problem-solvers-do-differently" class="hash-link" aria-label="Direct link to What experienced problem solvers do differently" title="Direct link to What experienced problem solvers do differently" translate="no">​</a></h2>
<p>Putting aside AI completely, the biggest advantages experienced problem solvers have in any given domain are:</p>
<ol>
<li class="">They have a more accurate, well-developed mental model of the problem space, and they invest time in continuously refining that model.</li>
<li class="">They use this model to formulate better hypotheses, prioritize based on chances of success, and estimate the effort required to test them.</li>
<li class="">They design and execute these tests efficiently- they focus on building understanding and reducing uncertainty with the minimum necessary time/effort.</li>
<li class="">They have a general awareness of when it would be more productive to double back instead of plowing forward. After invalidating a hypothesis, they're willing to retrace their steps past multiple previous forks in the road and use their newly refined model to reevaluate a previous decision.</li>
</ol>
<p>Novice engineers are pretty good at noticing how AI impacts #3 (executing tests on a specific hypothesis). In fact, I don't think they're usually thinking of what they're doing in terms of a scientific method with discrete parts; they're just instinctively taking guesses and charging ahead. The test result is often binary: "problem solved" or "problem not solved".</p>
<p>AI is so freaking powerful that this approach can be enough to solve a lot of kinds of problems. This is especially true if you've got an adequate mental model out of the gate, and you can craft a prompt with sufficient context. If you're dealing with a problem in a well documented space, there's a good chance that an AI can land you a hole-in-one on your first try.</p>
<p>But for an inexperienced problem solver faced with a problem that falls slightly out of this ideal zone, they can easily become lost in our branching road of possibilities. Or, they get stuck on a particular branch and keep trying different variations of a flawed approach.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="avoiding-forks-entirely">Avoiding forks entirely<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#avoiding-forks-entirely" class="hash-link" aria-label="Direct link to Avoiding forks entirely" title="Direct link to Avoiding forks entirely" translate="no">​</a></h3>
<p>Seasoned engineers often get a gut feel that they can't be the first person facing a particular problem they've encountered. And it's often true- depending on the question you're asking, the answer (or relevant data that can help you answer it) might be readily available on the internet.</p>
<p>As it turns out, AIs are exceptionally good at rapidly scouring the internet, ingesting huge swaths of text, analyzing and summarizing it. These types of questions can require orders of magnitude less time and effort to answer with AI than by a human doing the research.</p>
<p>Experienced engineers get so much more out of AI in this realm because they have the instinct to ask the question before starting any work- and AI reduces the cost of answering many of these questions dramatically.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="scouting-the-fork-ahead-more-quickly">Scouting the fork ahead more quickly<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#scouting-the-fork-ahead-more-quickly" class="hash-link" aria-label="Direct link to Scouting the fork ahead more quickly" title="Direct link to Scouting the fork ahead more quickly" translate="no">​</a></h3>
<p>When initial research leaves you with open questions, you're left with investment decisions: which experiments do you want to spend time on? For a software engineer, this often means writing enough working code to try out an idea- which can require hours/days of your scarce and valuable time. This can be a risky bet on a hypothesis- one that may bear no fruit.</p>
<p>With AI, you gain the remarkable ability to execute tests like this with superhuman speed. If you can articulate your experiment in a prompt (and usually guide it through some iterations), a decent AI agent can take care of the mechanics and gruntwork to generate the minimum viable code to answer a whole lot of questions- lickety-split. The fact that AIs tend to write code that's less than production quality is of little consequence when your goal isn't writing production code- it's answering questions.</p>
<p>The AI speed difference in this phase of the process isn't just incremental, it's transformative. An AI-assisted human can evaluate <em>exponentially</em> more hypotheses than a human alone. Experienced engineers are much more likely to utilize AIs in this way, and better equipped to interpret the results of each experiment.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="discovering-which-forks-exist">Discovering which forks exist<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#discovering-which-forks-exist" class="hash-link" aria-label="Direct link to Discovering which forks exist" title="Direct link to Discovering which forks exist" translate="no">​</a></h3>
<p>The two most common failure modes I observe when inexperienced software engineers get their hands on AI tools:</p>
<ul>
<li class="">they very rapidly execute fully formed solutions that carry very significant (and often unacceptable) tradeoffs they didn't recognize or understand.</li>
<li class="">they fail to solve the problem, and get stuck trying to find ways to solve it within the constraints they assume apply, but don't fully understand.</li>
</ul>
<p>Using our forking road analogy, they had gotten stuck after choosing a path without finding out <em>what other paths existed</em>.</p>
<p>In both cases, I would generally guide my younger colleagues through some <a href="https://en.wikipedia.org/wiki/Divergent_thinking" target="_blank" rel="noopener noreferrer" class="">divergent thinking</a>. Experienced problem solvers deliberately invest time in idea generation, exploration, research, and brainstorming- to expand the realm of possible solutions they can consider- before they begin converging on a specific solution.</p>
<p>Divergence is a core element of human creativity- and applies especially to difficult and novel problems which often require diversity of perspective and fresh approaches to solve.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="using-ai-to-promote-divergence">Using AI to promote divergence<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#using-ai-to-promote-divergence" class="hash-link" aria-label="Direct link to Using AI to promote divergence" title="Direct link to Using AI to promote divergence" translate="no">​</a></h2>
<p>The first step in divergent thinking usually involves expanding the scope of your existing mental model by surveying preexisting knowledge. For software engineers, it might mean reading up on some relevant computer science concepts, product documentation, or open-source code- and letting this open up possibilities beyond the constraints you previously perceived.</p>
<p>This is where I think AI is drastically underutilized by inexperienced users- for exploration, learning, and idea generation in a broad problem space. AI is <em>incredibly useful</em> when used in this way. In our forking road analogy, it's like having the ability to fly a drone to scout ahead so you can see all the forks available within a few hours' walk, eliminate the ones that obviously result in dead ends, and discover new ones that look very promising.</p>
<p>AI's current flaws, like hallucination and sycophancy, are a lot less problematic when you're using it to augment divergent thinking- since any idea generated in this phase of problem solving (including by humans) can be flawed. It's part of the game- you take early ideas with a grain of salt.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>We all get stuck in our own ideas sometimes</div><div class="admonitionContent_BuS1"><p>Persistence, I'm told, is one of my more admirable qualities... meaning I'm stubborn AF. I occasionally get stuck trying to make a particular idea work, only to fail repeatedly. Even with AI assistance, I'll sometimes find myself trying variations on the same approach until I get maximally frustrated.</p><p>When I'm my more enlightened self, I remember this is a great time to step back, open a new prompt with an empty context window, and treat the AI like a colleague who can give me a fresh perspective. I start asking questions that broaden the range of possible solutions.</p><p>I focus on prompts that specifically elicit divergence:</p><ul>
<li class="">I ask for multiple options and/or crazy ideas</li>
<li class="">I ask it to challenge my assumptions</li>
<li class="">I ask it questions that help build my general knowledge of the system or problem space</li>
<li class="">I ask it to look for examples of similar problems</li>
</ul><p>Even untenable solutions, or ones which may not apply in my current case, might spark an idea that leads to something promising.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="build-your-mental-model-then-go-nuts">Build your mental model, then go nuts<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#build-your-mental-model-then-go-nuts" class="hash-link" aria-label="Direct link to Build your mental model, then go nuts" title="Direct link to Build your mental model, then go nuts" translate="no">​</a></h3>
<p>Lately I've been putting deliberate effort into using AI in this way: to build my own understanding of problem spaces. I've been using AI agents as research assistants that I send off on missions to educate me on some topic, at some level of abstraction that I'm interested in. This is usually fairly broad and shallow to start, and from there I can explore the space interactively by peppering them with follow-up questions.</p>
<p>This is a process I used to do with Google searches, O'Reilly books, and a whole lot of reading in bed. It's a process of synthesis- sifting through large amounts of information to build a mental model, so there's a large amount of waste involved, by definition.</p>
<p>If it had been an option, I would have <em>much</em> rather spent an hour talking with an expert on a topic and grilling them with questions. Having to grind through hundreds of pages of source materials on my own, when 95% of the content wasn't relevant, or wasn't at the level of abstraction I was aiming for, wasn't an efficient way to learn.</p>
<p>Current LLMs are <em>incredibly</em> good at this mode of exploration, especially if you keep your skeptic hat on all the while. Applying a little Socratic method to any aspect of their findings that triggers my spidey-sense often helps them self-correct. It helps that I have years of experience engaging in this same process with humans- who, I have noted, are also sometimes wrong.</p>
<p>Having an "expert" give me a broad overview of a problem space also makes me a whole lot more efficient at doing my own manual research. Armed with concepts, vocabulary, and starter sources, I can get right in the weeds and begin reading primary sources on a topic I'm a complete beginner in.</p>
<p>Once my mental model is sufficiently filled out, I find I'm <em>way</em> more effective at the subsequent phases of problem solving. I've developed some vocabulary around key concepts and their relationships, and my intuitions allow me to formulate much better prompts to guide AIs through code generation. I'm also much better at recognizing subsequent hallucinations or other bad ideas.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ais-dont-grant-problem-solving-skills">AIs don't grant problem solving skills<a href="https://labaneilers.com/using-ai-to-promote-divergent-thinking#ais-dont-grant-problem-solving-skills" class="hash-link" aria-label="Direct link to AIs don't grant problem solving skills" title="Direct link to AIs don't grant problem solving skills" translate="no">​</a></h2>
<p>Many non-engineers hold the misconception that software engineers spend most of their time and effort writing code: the kind of code that LLMs are getting increasingly good at writing every day. It is true that we spend a good amount of time staring at code on a screen, but obscures the truth that our time is <em>absolutely dominated</em> by more general problem solving: designing, debugging, troubleshooting, and analyzing systems.</p>
<p>Engineers who understand how to problem solve using the scientific method, who value mental models, and who practice deliberate divergent thinking, have always outperformed those who don't. Augmenting them with powerful AI tools is going to drastically accentuate this discrepancy.</p>]]></content:encoded>
            <category>AI</category>
            <category>programming</category>
        </item>
        <item>
            <title><![CDATA[The benefits of befriending a birder]]></title>
            <link>https://labaneilers.com/the-benefits-of-befriending-a-birder</link>
            <guid>https://labaneilers.com/the-benefits-of-befriending-a-birder</guid>
            <pubDate>Fri, 22 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[My daughter became a birder, and it's had some surprisingly positive effects on my life.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>My daughter became a birder, and it's had some surprisingly positive effects on my life.</p></div></div>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/woodpecker-8d112aae8aed58250d395219a7d1a426.jpg" alt="Woodpecker"><figcaption>Hairy Woodpecker</figcaption><figcaption class="attribution">© Sadie Despres, 2025</figcaption></figure>
<p>A couple years ago, my daughter, Sadie, began a gradual transformation that eventually led to her being able to honestly describe herself as "bird obsessed". Everywhere we go, I notice her staring up at trees, rooftops, and sometimes vaguely up and into the distance. When she was younger, she used to get lost in daydreams, and I thought this was just a remnant of that, but I've come to understand this is actually a product of intense focus and fascination- with our feathered neighbors.</p>
<p>I'm not sure exactly where this all started, but it was most likely rooted in her interest in photography. I'm told this is common: photographers become birders, and birders become photographers- it's a pretty natural progression.</p>
<!-- -->
<p>We took a trip to New York about a year ago, and my wife and I were a bit surprised (and a little disturbed) that at least three quarters of Sadie's photos were of pigeons. Like most city dwellers, I tend to think of pigeons as dirty sky vermin, but to her, they were unendingly curious and delightful. They apparently have intricate social lives, with backstories, drama, conflict, family ties, and complex rituals. She told us about how these feral birds are actually the descendants of escaped domestic pigeons, who themselves were originally bred from wild "rock doves", who are cliff-dwelling foragers.</p>
<p>Last Thanksgiving, she got to know one of our family friends: an octogenarian, former biology professor and amateur wildlife photographer who lives in South Florida. He invited her to join a text chain with himself and some of his friends, where he shares a recent photo every day. As it turns out, she finds this absolutely splendid, and despite being at least sixty years younger than any of the other participants, enjoys both the photos (which are admittedly quite beautiful) as well as the banter (observing the way a group of eighty-something former professors struggle with SMS can be entertaining in itself).</p>
<p>While her trajectory was probably already predetermined, seeing these pictures definitely accelerated things. It also helped that we gave her some money for her birthday towards a new camera, which she supplemented from her own earnings as a camp counselor to buy a pretty nice entry-level Canon and lens... which she now carries <em>everywhere</em>.</p>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/pelican-e76e92314b46480ffca9675febc4c24b.jpg" alt="Pelican"><figcaption>Brown Pelican</figcaption><figcaption class="attribution">© Sadie Despres, 2025</figcaption></figure>
<p>Fast forward to today: Sadie has become <em>shockingly</em> knowledgeable about birds. She recognizes virtually every bird in our local area by their calls, and can rattle off facts about Cedar Waxwings, Chimney Swifts, Carolina Wrens, and a hundred other birds I would have never known about, absent our dinner conversations.</p>
<p>As Sadie's dad, I'm certainly thrilled that she's found something she's so excited about, especially given that it gets her outside and away from her phone and social media. There's <em>certainly</em> worse things a teenager could be into.</p>
<p>But what's been most surprising is how much her interest in birds has impacted my life as well.</p>
<p>As a person who types for a living, I have a tendency to forget to exercise. Having a high-energy dog has helped a bit, since it forces me to go for a walk at least a couple times a day. I do enjoy walks, and occasionally go on a longer hike. It wasn't until birding that Sadie started to get interested in going on more of them with me.</p>
<p>Of course, I love just having an excuse to spend some time with her. Teenage girls don't exactly have a reputation for enjoying time with their dads, so I'm savoring time I know is going to slip away fast. But beyond just that, I've noticed some changes in the way I think about my relationship with the outdoors.</p>
<p>When I hike, I tend to get lost in my own head- often working through whatever programming or management challenges I'm facing at work. This has generally been useful to me- an afternoon dog walk usually helps me to step back and rethink whatever I'm working on, and I usually end up finding better approaches than I would of if I'd just kept plowing through.</p>
<p>But since I've been doing birding-oriented walks with Sadie, I've found my attention focused in new ways. I'm seeing more of what's happening all around me: absorbing the movement of the trees, the wind, or little flutters in the brush. My full brainpower is engaged processing data from my peripheral vision. When we hear a bird call, we fall silent, trying to pinpoint the source of the sound, and stand quietly for a few moments.</p>
<p>All this use of my attention feels a lot like what I've tried (and failed) to achieve in the past through meditation. I always had trouble maintaining the patience and persistence that it takes to interrupt my thought patterns- where I get stuck repeatedly reliving the past and anticipating the future.</p>
<p>It's been much easier to achieve this kind of mindful focus when we go birding together- especially when she's radiating excitement about a new bird. I can't help but absorb some of her enthusiasm, and it helps to sustain me and keep me present.</p>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/grouse-172e1c16801988feb37671bf09d595e4.jpg" alt="Grouse"><figcaption>Sooty Grouse</figcaption><figcaption class="attribution">© Sadie Despres, 2025</figcaption></figure>
<p>What's more: I find that this attention to my surroundings is sticking- even when she's not with me. Previously, odds are I'd be visualizing data structures, replaying the latest difficult conversation I've had with a coworker, or being angry about a government causing needless suffering.</p>
<p>Instead, I'm directing more of my attention to ferns, wildflowers, rock formations, or the flow of water in the creek that borders one of my favorite trails. I'm noticing the sensation of the wind on my skin, or the feeling of endorphins from the exercise.</p>
<p>I recently started have these little moments of spontaneous gratitude- being momentarily overwhelmed by how much beauty there is to be found in my little pocket of Eastern Massachusetts.</p>
<p>I have friends and loved ones who experience real, chronic anxiety and depression, and I recognize how lucky I am that my head isn't generally an unpleasant place to be. I do often enjoy getting lost in an engineering problem while I walk or drive, and thinking through recent social challenges often allow me to temper my initial emotional reactions- which I think make me a better colleague, friend, dad, and husband. Even some of the time I spend worrying about politics is useful as motivation to get me off my ass and do something.</p>
<p>But I have had trouble, especially over the last few years, worrying a lot about things I can't control- in ways that aren't helpful to me, or anyone else- ways that are more paralyzing than motivating. It's given me a bit more understanding of what debilitating anxiety might be like.</p>
<p>Birding with Sadie has, surprisingly, given me a new perspective on how I can choose to use my attention. I still spend some of my walks in my imaginary idea world, but I'm finding more and more often, I can take a few deep breaths, reset, and listen for bird calls.</p>]]></content:encoded>
            <category>storytime</category>
        </item>
        <item>
            <title><![CDATA[So you want to be an engineering manager]]></title>
            <link>https://labaneilers.com/so-you-want-to-be-an-engineering-manager</link>
            <guid>https://labaneilers.com/so-you-want-to-be-an-engineering-manager</guid>
            <pubDate>Mon, 23 Jun 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[This is a summary of hundreds of conversations I've had with prospective engineering managers about what life looks like as a manager- the purpose and responsibilities of the role, but more importantly- what you actually do every day.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>This is a summary of hundreds of conversations I've had with prospective engineering managers about what life looks like as a manager- the purpose and responsibilities of the role, but more importantly- what you actually <em>do</em> every day.</p></div></div>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/climber-14e7c6a193ac0bc52e169bb063e7f2c2.jpg" alt="Climbing that thing"></figure>
<p>I chat with a lot of younger engineers about their future, career goals, and what kind of work makes them happy. By far, the most frequent conversation I have is about engineering management- what it involves, how to get into it, and whether it's a good fit for them.</p>
<p>I've also worked with people who've been thrown into management without any real guidance or preparation. This can be a really hard situation, because you absolutely need to be able to ask for help, but it can be terrifying as a new manager to admit that you're struggling.</p>
<p>I thought I'd spend some time summarizing the conversations I often have with prospective managers of what I think life as an effective engineering manager looks like. I'll start with the high level purpose, and then we'll get into the details of what day-to-day activities look like.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-whole-point-is-business-outcomes">The whole point is business outcomes<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#the-whole-point-is-business-outcomes" class="hash-link" aria-label="Direct link to The whole point is business outcomes" title="Direct link to The whole point is business outcomes" translate="no">​</a></h2>
<p>Let's strip away a bunch of complexity and make this really simple. There's one fundamental purpose of an engineering manager: to <strong>deliver business outcomes</strong>.</p>
<p>Wait, just that <em>one</em> thing? Yeah, though I understand your confusion. It's easy to get distracted by all the things that managers have to do to make this happen:</p>
<ul>
<li class="">recruiting and building teams</li>
<li class="">coordinating priorities and roadmaps</li>
<li class="">nurturing culture</li>
<li class="">owning technology</li>
<li class="">managing stakeholders</li>
<li class="">growing and mentoring team members</li>
<li class="">...and a million other things</li>
</ul>
<p>Yes, these are examples of activities that a manager generally has to do to drive business outcomes, but don't conflate the <em>mechanism</em> with the <em>underlying purpose</em> of their role.</p>
<p>Managers are given a lot of power over people, processes, budgets, and other precious resources, and in exchange, they're <em>accountable for what their teams achieve</em>. The management hierarchy depends on this accountability to carry out their strategy across the organization as a whole, and thus managers are essential as points of leverage for senior leadership.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="some-cautionary-tales">Some cautionary tales<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#some-cautionary-tales" class="hash-link" aria-label="Direct link to Some cautionary tales" title="Direct link to Some cautionary tales" translate="no">​</a></h3>
<p>The Machiavellian vibe of this point sometimes elicits doubt, so let me put this in even more stark terms:</p>
<p><em>An engineering manager whose team consistently delivers successful business outcomes <strong>can get away with a lot</strong></em>.</p>
<p>Here's some examples, from my personal experience, that illustrate the extremes of this dynamic:</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="that-guys-a-jerk-but-he-builds-a-mean-product">That guy's a jerk, but he builds a mean product<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#that-guys-a-jerk-but-he-builds-a-mean-product" class="hash-link" aria-label="Direct link to That guy's a jerk, but he builds a mean product" title="Direct link to That guy's a jerk, but he builds a mean product" translate="no">​</a></h4>
<p>I once worked (adjacently) with a manager who I personally (quietly) believed to be an asshole, but who had been successfully running a team that was doing some admittedly great product innovation. He was a darling of the executive team, and was given extraordinary leeway to run his team in some quirky and unconventional ways.</p>
<p>I later found out one of the "quirky and unconventional" things he was doing was emotionally abusing the shit out of his team. This had apparently been going on for a while, and while he did eventually get fired, it was only after multiple, <em>excellent</em> engineers left the company (after having the courage to lodge complaints with HR, even in the face of explicit threats of retaliation). Even after all this, he was only actually held accountable once all this started to <em>impact his team's delivery</em>.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-healthy-team-isnt-the-only-goal">A "healthy" team isn't the only goal<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#a-healthy-team-isnt-the-only-goal" class="hash-link" aria-label="Direct link to A &quot;healthy&quot; team isn't the only goal" title="Direct link to A &quot;healthy&quot; team isn't the only goal" translate="no">​</a></h4>
<p>Around the same time, I worked with another manager who was very bright and competent, a deeply caring, empathetic leader, and all-around lovely person. She was very experienced in building teams with a great, positive, supportive, collaborative culture of engineering excellence. They had many of the hallmarks of a high-performing team: motivated, engaged engineers who felt empowered to do their best work, and who really enjoyed working together.</p>
<p>But despite her talent for nurturing teams' health and culture, she wasn't particularly connected to our customers' needs, or the technical details of the systems her team was building. This misalignment of focus built up over time and led to some underwhelming results. She eventually lost the trust of her stakeholders, and was managed out as her team got re-orged.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="business-isnt-about-harmony">Business isn't about harmony<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#business-isnt-about-harmony" class="hash-link" aria-label="Direct link to Business isn't about harmony" title="Direct link to Business isn't about harmony" translate="no">​</a></h4>
<p>If you're thinking both these examples are really fucked up- yeah. While I wish businesses were motivated primarily by creating a great working environment, nurturing a culture of respect, creativity, and collaboration- they're not. They're motivated by economic success- and <em>everything else they do is in service to that</em>.</p>
<p>Luckily for us, thought leaders in our industry have discovered that, at least in knowledge work, value delivery strongly correlates with great cultures and empowered, motivated teams. Businesses that build great cultures do so with the belief that it makes them more competitive- and they're often right.</p>
<p>It's really critical as a manager to recognize the nature of this dynamic. Successful managers use the framing of business outcomes to make decisions, articulate their strategy, and execute against their plans.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ok-so-what-does-a-manager-actually-do">OK, so what does a manager actually <em>do</em>?<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#ok-so-what-does-a-manager-actually-do" class="hash-link" aria-label="Direct link to ok-so-what-does-a-manager-actually-do" title="Direct link to ok-so-what-does-a-manager-actually-do" translate="no">​</a></h2>
<p>So now that we understand the primacy of business outcomes, let's do a quick, 100 level overview of the mechanisms and activities that an engineering manager generally has to do to deliver them.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="strategy-and-alignment">Strategy and alignment<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#strategy-and-alignment" class="hash-link" aria-label="Direct link to Strategy and alignment" title="Direct link to Strategy and alignment" translate="no">​</a></h3>
<p>Successful engineering managers act as a bridge between their teams and senior leadership, translating the organization's strategy into tactical, incremental goals for their team. This happens through a few different channels:</p>
<ul>
<li class="">Cascading high-level strategy through the management hierarchy</li>
<li class="">Helping the team formulate a roadmap that aligns with the strategy</li>
<li class="">Facilitating alignment with product management and other business stakeholders</li>
<li class="">Facilitating alignment with other engineering and supporting teams (UX, InfoSec, etc)</li>
</ul>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="managing-up">Managing up<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#managing-up" class="hash-link" aria-label="Direct link to Managing up" title="Direct link to Managing up" translate="no">​</a></h4>
<p>This isn't a one-way flow of information. Effective managers also influence the decisions of their senior leadership, usually because they have visibility into details of the work on the ground- constraints, opportunities, trade-offs- that senior leadership doesn't.</p>
<p>Being able to effectively identify and articulate these constraints in terms that the business understands is often what distinguishes successful managers from the rest. And <em>really</em> effective managers go beyond surfacing impediments- they show up with <em>solutions</em>, and they're able to sell them to senior leadership and other stakeholders.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="facilitating-alignment-with-external-teams">Facilitating alignment with external teams<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#facilitating-alignment-with-external-teams" class="hash-link" aria-label="Direct link to Facilitating alignment with external teams" title="Direct link to Facilitating alignment with external teams" translate="no">​</a></h4>
<p>External teams are likely have different reporting lines, priorities, processes, and cultures than the team(s) you manage. You don't have any direct power over them- but you depend on each other to get anything done. How do you manage this?</p>
<p>Successful managers build and nurture trusting, positive relationships outside of the team: with senior management, stakeholders, other managers within Engineering and across other functions. There's no trick to this: it requires a lot of empathy, integrity, follow-through, and attention to detail- consistently- over a long period of time. The reputation you build is a <em>huge</em> factor in your ability to succeed.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>alignment is your job</div><div class="admonitionContent_BuS1"><p>I can tell you one thing: your CEO does not want to hear that you couldn't deliver an outcome because another team isn't cooperating with you. <em>It's your job to figure that shit out</em>.</p><ul>
<li class="">Do you disagree with a product owner about the value of a feature? <em>Work that shit out</em>.</li>
<li class="">Does your team's tech lead disagree with another team's on an API integration? <em>Work that shit out</em>.</li>
<li class="">Does the VP of marketing want you to take on unsustainable tech debt to hit a PR deadline? <em>Work that shit out</em>.</li>
</ul><p>Horse trading, politics, favors, and even escalation- these are all tools available to you. Making sure the alignment happens is your job.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="team-building">Team building<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#team-building" class="hash-link" aria-label="Direct link to Team building" title="Direct link to Team building" translate="no">​</a></h3>
<p>Counter to the instincts of many new managers, building a high-performing team isn't about exerting control over people, rather, it's an <strong>exercise in system design</strong>. As a manager, your job isn't making all decisions and handing them off for execution (a.k.a. "micromanagement" or "command and control"). Your job is to <em>make sure good decisions get made</em>.</p>
<p>You do this by defining processes, culture, policies, and other organizational structures that come together, along with the people you hire, to form a self-sustaining system. You need to continuously optimize this so that team members feel connected to the mission, understand their freedoms and constraints, and are empowered to make decisions.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>A good team can survive a vacation</div><div class="admonitionContent_BuS1"><p>A good trick for considering how self-sustaining a team should be: how long could the manager of a team be gone on vacation before the team would start to struggle?</p><p>A great manager will have established a solid, self-sustaining culture, mature practices, and healthy relationships with stakeholders- that don't depend directly on the manager. A team like this should be able to function for weeks or even months without their manager. You'd expect that beyond that, without someone explicitly nurturing the system, entropy might begin to set in and delivery would begin to deteriorate.</p></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="driving-continuous-improvement-through-visibility-and-accountability">Driving continuous improvement through visibility and accountability<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#driving-continuous-improvement-through-visibility-and-accountability" class="hash-link" aria-label="Direct link to Driving continuous improvement through visibility and accountability" title="Direct link to Driving continuous improvement through visibility and accountability" translate="no">​</a></h4>
<p>Because there's no "one size fits all" formula for building a system as complex as an engineering team, managers need to guide the team through a process of <strong>continuous improvement</strong>. As a manager, you'd help to set the wheels in motion- establishing the processes and culture to enable the team to repeatedly self-reflect on their work, identify areas for improvement, and implement them.</p>
<p>As the steward of this process, you need to establish two important controls:</p>
<ul>
<li class=""><em>visibility</em> into your team's operations</li>
<li class=""><em>accountability</em> for the team's outcomes, and individuals' contributions to these outcomes</li>
</ul>
<p>Some examples of visibility mechanisms:</p>
<ul>
<li class="">attending team ceremonies (e.g. stand-ups, planning, retrospectives, operations reviews, etc)</li>
<li class="">holding regular 1:1s with team members</li>
<li class="">meeting with stakeholders to get feedback on the team</li>
<li class="">soliciting direct feedback or through formal performance reviews</li>
<li class="">tracking work through management systems (tickets, kanban boards, CI/CD systems, etc), including metrics derived from them (e.g. cycle time, deployment frequency, change failure rate, etc)</li>
</ul>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>psychological safety</div><div class="admonitionContent_BuS1"><p>In order for these visibility mechanisms to be effective, you'll need to work to establish <em>psychological safety</em>; team members need to feel safe (and encouraged) to speak up, share their ideas, and give honest critical feedback to others (including their manager).</p><p>This is <em>much</em> harder than you think. Giving critical feedback isn't easy to do in most cultures, and by default, people will tend to let their disappointment fester into grudges rather than take the social risk of speaking up.</p><p>Luckily, there's a lot of great research on how to build psychological safety for teams. A great resource is <a href="https://www.hbs.edu/faculty/Pages/item.aspx?num=54851" target="_blank" rel="noopener noreferrer" class="">The Fearless Organization</a> by Amy Edmondson, or this shorter <a href="https://hbr.org/2025/05/what-people-get-wrong-about-psychological-safety" target="_blank" rel="noopener noreferrer" class="">Harvard Business Review article</a> that discusses misconceptions.</p></div></div>
<p>While accountability mechanisms are all ultimately derived from the manager's ability to promote/fire, in practice, you utilize this power very, very rarely. 99.9% of the time, accountability is driven through:</p>
<ul>
<li class="">Providing feedback to team members (e.g. praise, encouragement, guidance, or critical feedback)</li>
<li class="">Clarifying expectations by updating processes, policies, etc.</li>
<li class="">Mediating when misalignment/conflict arises (within the team or with external stakeholders)</li>
<li class="">Direct involvement/interventions in various team decisions</li>
</ul>
<p>In cases of persistent underperformance, you have an unfortunate, but critical, responsibility: ramp up the directness of feedback, engage HR, formulate performance improvement plans, and occasionally manage someone out.</p>
<p>On the other hand, if you're lucky, you'll have people on the team who are growing quickly, having a big impact, and just overall killing it. In most companies, you'll need to do some work to compile documentation recommending a promotion. The process can be a pain, but man, it is super rewarding to be able to tell someone you manage that they've earned a promotion.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="advocacy-and-removing-blockers">Advocacy and removing blockers<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#advocacy-and-removing-blockers" class="hash-link" aria-label="Direct link to Advocacy and removing blockers" title="Direct link to Advocacy and removing blockers" translate="no">​</a></h4>
<p>As a manager, you're in a unique position to help your teams identify impediments and obstacles- things that are slowing them down or preventing them from finding the best solutions to problems- and help knock them down. Does your team need some new SaaS tool? Is UX constantly late delivering mockups? Are you blocked waiting on InfoSec to complete a security review? It's your job to make sure these obstacles get removed.</p>
<p>Note that I said that it's your job to <em>make sure</em> these obstacles get removed, not that you have to do it yourself. A strong team will feel empowered to reach outside their normal boundaries and work to address external blockers themselves. A really successful team will build their own cross-organizational relationships, reputations, and influence- expanding their ability to solve problems autonomously.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>the galaxy brain manager</div><div class="admonitionContent_BuS1"><p>Additionally, really great managers will also recognize that certain patterns of blockers are <em>systemic</em>, and use their influence to address these through broader change, perhaps across multiple teams, or even at the organizational level.</p></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="recruiting">Recruiting<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#recruiting" class="hash-link" aria-label="Direct link to Recruiting" title="Direct link to Recruiting" translate="no">​</a></h4>
<p>Recruiting is one of the most impactful things a manager does, because the choice of people to add to the team will have huge and lasting effects, often well beyond the manager's tenure.</p>
<p>I don't think I'm the only person who <em>absolutely hates the process of recruiting</em>. It's often grinding, tedious, and stressful. Making the decision to hire someone after one interview can feel a lot like getting married to someone you just met at speed dating. And if your instincts tell you to pass on someone who you're on the fence about, you've just committed yourself to many additional hours of reviewing resumes, interviewing, and debrief sessions.</p>
<p>Hiring the wrong person can cause enormous, long-term problems for you and your team. It can be very, very hard to manage someone out- not just in terms of time, but in disruption to you and your team, not to mention your own emotional energy. Having to fire someone <em>absolutely sucks</em>.</p>
<p>But damn, when you hire the right people, it can be utterly game changing for you and your team.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="career-management">Career management<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#career-management" class="hash-link" aria-label="Direct link to Career management" title="Direct link to Career management" translate="no">​</a></h4>
<p>Finally, as a manager, you're generally expected to help coach your team members, helping them grow their skills and progress their careers. This is a little different than the type of mentoring that happens in the context of the day-to-day work; it's usually framed in terms of longer term goals. While this depends on their personal goals, it's most often framed in terms of helping them grow towards a promotion.</p>
<p>This is usually an exercise in assisted self-reflection, goal setting, and articulating the skills and behaviors that objectively define the next level of seniority.</p>
<p>It also might entail advocating for someone externally, helping to open doors for them, encouraging others to take chances on them, and put your own reputation up as collateral.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Managers aren't therapists, except sometimes</div><div class="admonitionContent_BuS1"><p>It's surprisingly common for managers to describe 1/1 meetings as feeling like they're a therapist for their team members. There is, admittedly, some overlap. Sometimes employees need to vent, talk them through a difficult situation, and just to know that you give a shit about their well-being.</p><p>Figuring out the 1/1 relationships with your team members is a big part of the job- setting appropriate boundaries, figuring out how much emotional energy you can afford with different people, and trying to reconcile your personal empathy and compassion with your responsibilities as a leader.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="no-seriously-what-does-a-manager-do-all-day">No seriously, what does a manager do all day?<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#no-seriously-what-does-a-manager-do-all-day" class="hash-link" aria-label="Direct link to No seriously, what does a manager do all day?" title="Direct link to No seriously, what does a manager do all day?" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="fine-its-meetings-always-meetings-so-many-meetings">Fine, it's meetings. Always meetings. So many meetings.<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#fine-its-meetings-always-meetings-so-many-meetings" class="hash-link" aria-label="Direct link to Fine, it's meetings. Always meetings. So many meetings." title="Direct link to Fine, it's meetings. Always meetings. So many meetings." translate="no">​</a></h3>
<p>What does this mean in terms of day-to-day work? I'll be honest- it means a <strong>metric shit ton of meetings</strong>:</p>
<ul>
<li class="">ceremonies: stand-ups, planning, retrospectives, scrum-of-scrums</li>
<li class="">1/1s: career coaching, status check-ins</li>
<li class="">brainstorming/design meetings, working sessions</li>
<li class="">performance management meetings</li>
<li class="">recruiting: interviews, debriefs</li>
<li class="">Ad hoc conversations</li>
<li class="">...and a head-exploding amount of other meetings</li>
</ul>
<p>To a new manager, especially one who was recently a high-performing individual contributor, this can seem wasteful- since meetings don't solve actual customer problems. But as organizations grow, managers are increasingly necessary as conduits of information; they maintain alignment and combat entropy. They make sure that teams aren't just working effectively, but that they're working on <em>the right things</em>.</p>
<p>You absolutely need to be careful about how you use meetings- they're a very expensive way to use people's time. But if you're using them carefully, preparing for them in advance, running them thoughtfully and efficiently, and you're actually achieving alignment and clarity, then it's not a waste of time.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="also-writing-too">Also writing, too<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#also-writing-too" class="hash-link" aria-label="Direct link to Also writing, too" title="Direct link to Also writing, too" translate="no">​</a></h3>
<p>You'll likely also spend a lot of time creating and consuming written communication and documentation. Some organizations have stronger cultures of knowledge management than others, but at the very least you'll be authoring/editing work management artifacts (e.g. tickets, designs), writing up meeting agendas/notes, sending out status updates, and juggling a whole lot of ad hoc conversations (e.g. in Slack, etc).</p>
<p>Writing is certainly important for all of these cases, but it's also really important as a leader to put additional effort into using writing as a tool to refine and clarify your thinking, especially when your ideas will carry disproportionate weight in the organization.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="staying-connected-to-the-practice">Staying connected to the practice<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#staying-connected-to-the-practice" class="hash-link" aria-label="Direct link to Staying connected to the practice" title="Direct link to Staying connected to the practice" translate="no">​</a></h3>
<p>Despite the pressures to constantly keep everyone aligned and make sure outcomes are being achieved, you need carve out time keep yourself connected to the actual practice of engineering: customer needs, your teams' backlogs, your tech portfolio, industry trends, best practices, etc. This may take the form of doing some side projects, reviewing code, or even just listening to a ton of podcasts.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>be careful about being both player and coach</div><div class="admonitionContent_BuS1"><p>You may be tempted to jump in with your team and write code alongside them, but you need to be careful about the dynamics of this. Between the power imbalance making reviews weird, and scheduling conflicts between your primary role and the requirements of taking operational responsibility for one's own code, this can be a bit fraught.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="i-chose-to-optimize-my-career-for-happiness">I chose to optimize my career for happiness<a href="https://labaneilers.com/so-you-want-to-be-an-engineering-manager#i-chose-to-optimize-my-career-for-happiness" class="hash-link" aria-label="Direct link to I chose to optimize my career for happiness" title="Direct link to I chose to optimize my career for happiness" translate="no">​</a></h2>
<p>If you've read other stuff I've written, you may know that I left management a few years back, and went back to being an individual contributor. This was actually a pretty hard decision for me, but one I'm really glad I made in retrospect.</p>
<p>Life as a manager is characterized by several things I actively sought to minimize as an IC: meetings, interruptions, and multitasking. As an IC, you generally seek to keep your work-in-progress to a minimum, and focus on getting a series of small, incremental things completely done. I never found a way to do this as a manager, since you have to be fairly reactive and attentive to the needs of everyone else around you. You're also responsible to handle lots of requests generated by external teams that don't track (or care) how much WIP they're generating for you.</p>
<p>I found managing much more draining from a social/emotional perspective, which was difficult given my tendency towards introversion. After a few years, I found myself having a much harder time getting out of bed in the morning- in a way I never did when I was writing software every day.</p>
<p>But that's me- some people absolutely thrive in this kind of dynamic, unpredictable environment, and find life in front of a text editor to be much more tedious.</p>
<p>If you're an IC faced with the opportunity to move into management, and you're not sure what to do, I'd encourage you to not only try it, but really dedicate yourself to some deliberate study of the craft for a while. Even if you decide it's not for you, the experience will make you a much more effective IC.</p>]]></content:encoded>
            <category>leadership</category>
            <category>engineering-management</category>
        </item>
        <item>
            <title><![CDATA[Using AI to create a Kubernetes controller in a hurry]]></title>
            <link>https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry</link>
            <guid>https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry</guid>
            <pubDate>Tue, 27 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[I've been looking for an excuse to do something more deliberate and ambitious with generative AI developer tools, so I created a Kubernetes controller which discovers Kubernetes-managed AWS load balancers, scrapes their CloudWatch metrics, and exposes them as Prometheus metrics (which have an infinitely better developer experience).]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>I've been looking for an excuse to do something more deliberate and ambitious with generative AI developer tools, so I created a Kubernetes controller which discovers Kubernetes-managed AWS load balancers, scrapes their CloudWatch metrics, and exposes them as Prometheus metrics (which have an infinitely better developer experience).</p><p>I'll share what I learned: the magnitude of the productivity boost, how effective it was at teaching me, some strategies I landed on, limits to the agentic tooling I used, and some surprising gaps in the models' abilities.</p></div></div>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/babys-first-k8s-controller-7fedf3a26e9ce9e4b310610bf0b8058e.jpg" alt="Baby's first Kubernetes controller"><figcaption>Baby's first Kubernetes controller</figcaption></figure>
<p>Since I've been primarily working in platform and infrastructure recently, a lot the "code" I've been writing is domain-specific configuration languages (e.g. terraform, helm charts, OTel collector config, Kyverno policies, etc). Despite so many glowing testimonies of massive wins with generative AI dev tools, the results from my first few attempts to use them were a bit underwhelming. My guess is that config code usually doesn't get open sourced, so the models just don't get a lot of good examples to train on.</p>
<p>This has left me hankering to do something more ambitious, and probably in a general purpose language, where I can get a better sense of what the tools can do (beyond fairly mundane auto-completion).</p>
<p>Then recently, I had a few days where a lot of my colleagues were off on vacation (school vacation in Massachusetts), and it occurred to me I'd been quietly stewing on a problem that might be a great candidate for my own little hackathon.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="aggravation-is-the-best-inspiration">Aggravation is the best inspiration<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#aggravation-is-the-best-inspiration" class="hash-link" aria-label="Direct link to Aggravation is the best inspiration" title="Direct link to Aggravation is the best inspiration" translate="no">​</a></h2>
<p>The problem I'd been pondering:</p>
<p>Load balancer metrics are incredibly valuable, and CloudWatch- the only way to get them in AWS- is a spectacular pain in the ass to use. As a result, load balancer metrics have been seriously underutilized. I wanted to find a way to improve the developer experience for load balancer metrics, and make it possible to scaffold out great Grafana dashboard panels and alert rules for them.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-are-load-balancer-metrics-so-important">Why are load balancer metrics so important?<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#why-are-load-balancer-metrics-so-important" class="hash-link" aria-label="Direct link to Why are load balancer metrics so important?" title="Direct link to Why are load balancer metrics so important?" translate="no">​</a></h2>
<p>Our applications are already instrumented with OpenTelemetry for traces and metrics, and are emitting structured logs. We've already got visibility into request counts, latency distributions, error rates, as well as all kinds of other telemetry for their application-specific operations. With all this rich data, why would we also need metrics from our load balancers?</p>
<p>Regardless of what your own application self-reports, the thing that ultimately matters most is <em>how your service is perceived by your users</em> (e.g. actual customers, or other services calling your API as a client). Load balancers are the actual interface users interact with, and there's a number of cases where if you were only to depend on self-reported telemetry from pods,  you'd be profoundly misled about your service's reliability.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="you-dont-mind-a-few-thousand-502s-on-every-deployment-do-you">You don't mind a few thousand 502s on every deployment, do you?<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#you-dont-mind-a-few-thousand-502s-on-every-deployment-do-you" class="hash-link" aria-label="Direct link to You don't mind a few thousand 502s on every deployment, do you?" title="Direct link to You don't mind a few thousand 502s on every deployment, do you?" translate="no">​</a></h3>
<p>For example: We use the <a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/" target="_blank" rel="noopener noreferrer" class="">aws-load-balancer-controller</a> to create AWS load balancers from Kubernetes Ingresses and Services. There's a not-so-well-documented footgun with this controller, in which a pod will continue to receive requests from the load balancer for up to 20 seconds after it receives a <code>SIGTERM</code> signal (which Kubernetes uses to tell it to shut down). This is because, after the aws-load-balancer-controller notifies the target group that a pod is terminating, the load balancer's target groups take a few seconds to update, and during this time they continue sending traffic to the pod's IP- which no longer exists. This results in 502 or 504 responses to the client.</p>
<p>Whomp whomp.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>I snoozed, lost</div><div class="admonitionContent_BuS1"><p>I had been meaning to write a post about this, but got beaten to the bunch by the folks at <a href="https://glasskube.dev/" target="_blank" rel="noopener noreferrer" class="">Glasskube</a>, who wrote an <a href="https://glasskube.dev/blog/kubernetes-zero-downtime-deployments-aws-eks/" target="_blank" rel="noopener noreferrer" class="">excellent article about this very problem</a>, and ended up with a virtually identical solution to us.</p></div></div>
<p>Note that in this case, <em>there's no pod to report an error</em>- the only evidence of the problem is the load balancer's own metrics- which is how we discovered this problem shortly after we started using the aws-load-balancer-controller a few years back.</p>
<p>There's plenty of other edge cases with Kubernetes and load balancers where requests will fail at the load balancer, and never make it to a pod- and it's really important to have visibility into these.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="cloudwatch-metrics-are-so-freaking-hard-to-use">CloudWatch metrics are so freaking hard to use<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#cloudwatch-metrics-are-so-freaking-hard-to-use" class="hash-link" aria-label="Direct link to CloudWatch metrics are so freaking hard to use" title="Direct link to CloudWatch metrics are so freaking hard to use" translate="no">​</a></h2>
<p>AWS load balancer metrics are available through CloudWatch (either through AWS's console or APIs), and we use Grafana for our dashboards and alerts, so we've been using Grafana's CloudWatch datasource to pull them in. This has been a generally frustrating developer experience.</p>
<p>Probably the biggest single problem is that CloudWatch queries for load balancer metrics require you supply a "Dimension" parameter called <code>LoadBalancer</code>, which is a portion of the load balancer's ARN (e.g. <code>app/scrumulator/2ab15bf8abef2f4c</code>). This contains an ID that's randomly generated by AWS- it's not something you can set. So when you create an ingress using the aws-load-balancer-controller, you have to wait for the controller to create the load balancer, and only then can you use AWS console or APIs to get the ARN.</p>
<p>Now, if you want to build a Grafana dashboard with this, you need to hard-code this <code>LoadBalancer</code> ID into your dashboard's configuration. And since we don't know the ID until after the load balancer is created, we can't scaffold out a dashboard or alert rules until the service is live in production. Yuck.</p>
<p>And since you probably want your dashboards parameterized so they can display metrics for multiple environments, you'd need to create a specially-formatted Grafana variable to map the environment to load balancer ID. This is <a href="https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#add-a-custom-variable" target="_blank" rel="noopener noreferrer" class="">possible</a>, but isn't particularly straightforward.</p>
<p>Don't worry- it gets worse. The CloudWatch datasource for Grafana has 2 completely different types of queries: "Metrics query" and "Metrics Insights", as well as 2 different editing modes "Builder" and "Code". This creates 4 different modes for writing queries, each with their own quirks and sharp edges- including how they work with Grafana variables. If you're lucky enough to find documentation or examples for one of these modes, it likely won't work at all with the others.</p>
<p>Oh, and did you want to be able to use a Grafana variable to switch the AWS region? Well that requires a <em>totally different</em> mechanism to parameterize, since the datasource itself is configured with a static region.</p>
<p>While all these problems can be worked around, the overall complexity, fragility, and inflexibility of this mess was causing teams to avoid using their load balancer metrics. Not good.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-elevator-pitch">The elevator pitch<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#the-elevator-pitch" class="hash-link" aria-label="Direct link to The elevator pitch" title="Direct link to The elevator pitch" translate="no">​</a></h2>
<p>Here was my idea:</p>
<p>I'd build a Kubernetes controller that would discover all the load balancers owned by Ingresses/Services in our clusters, scrape their CloudWatch metrics, and convert them into Prometheus metrics.</p>
<p>The big win here would be that I could control what labels to put on the Prometheus metrics- so I could create a <code>load_balancer_name</code> label with the load balancer name tag, which is deterministic, and specified in the Ingress. This would make building panels/alerts just as easy for load balancer metrics as it is for application metrics.</p>
<p>We could also add labels like <code>job</code>, <code>namespace</code>, <code>env</code>, <code>region</code>, and <code>cluster</code>, that matched the labels of the service that owned the load balancer. This would allow our developers to select metrics the same way they do for their application's own OTel metrics.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Prior art</div><div class="admonitionContent_BuS1"><p>I've used some other great tools to scrape CloudWatch metrics (e.g. <a href="https://github.com/prometheus-community/yet-another-cloudwatch-exporter" target="_blank" rel="noopener noreferrer" class="">YACE</a>) in the past, but none of them track the relationship between Kubernetes objects and the load balancers they manage, so they don't have the ability to add labels based on the owning Kubernetes object- which is key here.</p></div></div>
<p>For example, let's say I created a Kubernetes Ingress and used an annotation from the aws-load-balancer-controller to set the name:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">annotations</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">alb.ingress.kubernetes.io/load-balancer-name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> scrumulator</span><br></span></code></pre></div></div>
<p>Then, to get 5xx errors per minute, a fully parameterized PromQL query would look like:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">alb_http_code_elb_5xx_count_sum{</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    load_balancer_name="scrumulator",</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    env="$env",</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    region="$region"</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">}</span><br></span></code></pre></div></div>
<p>This would be <em>ridiculously</em> easy to use in comparison to doing it with CloudWatch, plus I'd be able to create simple templates to scaffold out Grafana panels and alert rules for these metrics.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="babys-first-kubernetes-controller">Baby's first Kubernetes controller<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#babys-first-kubernetes-controller" class="hash-link" aria-label="Direct link to Baby's first Kubernetes controller" title="Direct link to Baby's first Kubernetes controller" translate="no">​</a></h2>
<p>I'd never built a Kubernetes controller before, so I spent a few minutes digesting the <a href="https://sdk.operatorframework.io/docs/" target="_blank" rel="noopener noreferrer" class="">Operator SDK documentation</a> to get an idea of how it could work. The Operator SDK has a <a href="https://sdk.operatorframework.io/docs/cli/operator-sdk_create_api/" target="_blank" rel="noopener noreferrer" class="">scaffolding tool</a>, so I used that to create the shell of a basic controller that watches for changes to Ingresses in a cluster, added some dumb logging, and deployed it to a test cluster. After a little fiddling with RBAC and IAM integration, I had it logging each modification to any Ingress in the cluster.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="scraping-from-cloudwatch">Scraping from CloudWatch<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#scraping-from-cloudwatch" class="hash-link" aria-label="Direct link to Scraping from CloudWatch" title="Direct link to Scraping from CloudWatch" translate="no">​</a></h3>
<p>One of the biggest sources of uncertainty was around cost implications of scraping CloudWatch metrics, so I spent some extra time up-front making sure I understood the pricing model, and how to use the APIs in the optimal way (e.g. batching requests at the right size, keeping total returned datapoints under the right limits).</p>
<p>I spent the next few hours jumping between CloudWatch documentation and cajoling Github Copilot (alternating between a few different models) to write me some Go to grab metrics via the AWS CloudWatch SDK. It actually did a pretty great job on this. It needed a little feedback along the way, but for the most part, I did this almost entirely via prompts, and managed to avoid the temptation to just tweak the code by hand.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="exposing-prometheus-metrics">Exposing Prometheus metrics<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#exposing-prometheus-metrics" class="hash-link" aria-label="Direct link to Exposing Prometheus metrics" title="Direct link to Exposing Prometheus metrics" translate="no">​</a></h3>
<p>After I got it logging CloudWatch data points for a sample app's load balancer, I started looking into how to expose these as Prometheus metrics. I quickly discovered that the <a href="https://pkg.go.dev/github.com/prometheus/client_golang/prometheus" target="_blank" rel="noopener noreferrer" class="">Prometheus Go SDK</a>'s basic interfaces aren't designed for one of my fundamental requirements: setting the timestamp of the metrics to match the time of the CloudWatch datapoint.</p>
<p>I spent a bit of time figuring out how to implement the Prometheus SDK's <a href="https://github.com/prometheus/client_golang/blob/main/prometheus/collector.go" target="_blank" rel="noopener noreferrer" class="">Collector interface</a>, which allow's specifying the timestamp of a datapoint explicitly.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>LLM fails</div><div class="admonitionContent_BuS1"><p>The models I tried sucked at generating a Prometheus Collector implementation, probably because there's many fewer examples of using the Collector interface for this type of edge-use case.</p><p>I also tried explicitly pointing it at some <a href="https://github.com/google/cadvisor/blob/f6e31a3cff918285fd74cb1f20d0af32c3554a68/collector/prometheus_collector.go#L56" target="_blank" rel="noopener noreferrer" class="">example code</a> and <a href="https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#hdr-Custom_Collectors_and_constant_Metrics" target="_blank" rel="noopener noreferrer" class="">documentation</a> for context, but it didn't help much.</p><p>I also noticed that the botched attempts it made all resulted in compilation errors, and even in Copilot's agent mode, it didn't use feedback from the language server to iterate and fix the problems. I imagine once Github gets this working, the tooling will get a whole lot better.</p></div></div>
<p>I got it working roughly before I realized that this had some fundamental limitations:</p>
<ul>
<li class="">With a Prometheus client, the pull model introduces an additional time interval to data that's already delayed by a couple minutes by the CloudWatch API. It also adds unpredictability around the actual delay due to the alignment of the Prometheus scrape interval with my CloudWatch scrape interval.</li>
<li class="">The Prometheus text exposition format doesn't have a way to specify multiple datapoints (with different timestamps) for the same metric/labels combination. This means I couldn't "backfill" metrics with older timestamps once I had a newer one.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="switching-to-opentelemetry-metrics">Switching to OpenTelemetry metrics<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#switching-to-opentelemetry-metrics" class="hash-link" aria-label="Direct link to Switching to OpenTelemetry metrics" title="Direct link to Switching to OpenTelemetry metrics" translate="no">​</a></h3>
<p>I'm pretty familiar with the OpenTelemetry metrics model, so I figured I'd see if pushing metric datapoints via OTLP would overcome the limitations of the Prometheus client's pull model.</p>
<p>Again, I found I couldn't use the OTel Go SDK's primary interface, since it also doesn't allow setting the timestamp of a metric explicitly. Once again, I dropped down to the lower-level OTLP APIs, and it didn't take long before I was shipping metrics to our Grafana-hosted Prometheus backend.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>LLMs aren't going to take your job yet</div><div class="admonitionContent_BuS1"><p>Again, the LLM wasn't able to help a lot with this part, because there's very few examples of writing metrics using the OTLP APIs, other than in the OTEL SDK itself. It made a few botched attempts, which didn't work at all.</p><p>This said: even though the code it wrote it was completely wrong, gave me hints about where to look in the OTel SDK source code, which ultimately got me to working, hand-written code.</p></div></div>
<p>I built a dashboard that allowed me to compare the underlying CloudWatch metrics with the scraped Prometheus metrics, which helped me work through a bunch of edge cases and quirks of CloudWatch APIs.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>CloudWatch metrics lie!</div><div class="admonitionContent_BuS1"><p>Unlike Prometheus, in which datapoints are immutable once written, CloudWatch metrics' datapoints can actually change for up to a few minutes after they're written! I ended up realizing I simply couldn't trust a CloudWatch datapoint that was younger than 2 minutes old, since they almost always report a value that is much too low. Once I updated the controller with a rule to only scrape 2-minute old data, the series began to look virtually identical between the two sources.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="rubber-meet-road">Rubber, meet road<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#rubber-meet-road" class="hash-link" aria-label="Direct link to Rubber, meet road" title="Direct link to Rubber, meet road" translate="no">​</a></h3>
<p>At this point, I was fairly certain this approach was going to work, so I spent another hour or two cleaning it up before I started showing it to some folks to see if they thought it was useful... which they did, because CloudWatch is objectively terrible.</p>
<p>Skipping ahead a few more days, I'd polished it up, added support for Services and NLBs, done a bunch of refactoring and testing, added some self-telemetry, and deployed it to a couple pre-production clusters. After some more refinement, it went out to all our clusters in production.</p>
<p>Within another couple days, we have dozens of Grafana dashboards and alert rules using these new metrics. Within the first day of using the metrics, I found an error with the readiness gates on one of our internal services that was causing 5XXs on pod terminations.</p>
<p>Great success!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reflections-on-using-gen-ai-for-something-non-trivial">Reflections on using gen AI for something non-trivial<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#reflections-on-using-gen-ai-for-something-non-trivial" class="hash-link" aria-label="Direct link to Reflections on using gen AI for something non-trivial" title="Direct link to Reflections on using gen AI for something non-trivial" translate="no">​</a></h2>
<p>While the LLMs I used weren't much help in writing code against lower-level Prometheus or OTel APIs, it was tremendously helpful in speeding me up a bunch of other ways.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="writing-code-against-widely-used-apis">Writing code against widely used APIs<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#writing-code-against-widely-used-apis" class="hash-link" aria-label="Direct link to Writing code against widely used APIs" title="Direct link to Writing code against widely used APIs" translate="no">​</a></h4>
<p>Writing basic data manipulation code for things like the CloudWatch API response schema isn't exactly hard, but it can be time consuming. With Copilot, I barely had to look at the schema documentation at all; the LLM got this almost entirely right.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="implementing-algorithms-and-data-structures">Implementing algorithms and data structures<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#implementing-algorithms-and-data-structures" class="hash-link" aria-label="Direct link to Implementing algorithms and data structures" title="Direct link to Implementing algorithms and data structures" translate="no">​</a></h4>
<p>Once I had a well written prompt, the LLMs did really great at writing the boring plumbing code to manage metadata (e.g. mapping CloudWatch metric names and labels to Prometheus versions). It initially wrote some mildly inefficient code, and needed a little high-level feedback to optimize some data structures, but ultimately did the bulk of this work for me- with the bulk of the prompts being closer to business requirements than implementation details.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="suggesting-high-level-architectural-approaches">Suggesting high-level architectural approaches<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#suggesting-high-level-architectural-approaches" class="hash-link" aria-label="Direct link to Suggesting high-level architectural approaches" title="Direct link to Suggesting high-level architectural approaches" translate="no">​</a></h4>
<p>I don't have a ton of Go experience, so it took me a while to land on a concurrency model that worked well for this use case (something like an actor model). Once I realized I didn't like my first approach, and came up with a good high-level description of what I wanted, the LLM was able to produce the bones of an actor model implementation that ended up being exactly what I wanted.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="refactoring">Refactoring<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#refactoring" class="hash-link" aria-label="Direct link to Refactoring" title="Direct link to Refactoring" translate="no">​</a></h4>
<p>While iterating through various concurrency models, and also with the switch to the OTel SDK, the tooling was incredibly useful for refactoring. This is where, even without integration with the Go language server, Copilot's agent mode saved a huge amount of time on tasks that would otherwise have been pretty mechanical and tedious.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="building-features--with-the-right-prompt-granularity">Building features- with the right prompt granularity<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#building-features--with-the-right-prompt-granularity" class="hash-link" aria-label="Direct link to Building features- with the right prompt granularity" title="Direct link to Building features- with the right prompt granularity" translate="no">​</a></h4>
<p>The optimal granularity I've found for prompts is to describe relatively small, but complete features- like a single, focused user story. When I asked the models to do more than that at a time, I ended up wasting a lot of cycles trying to understand what it did, and it was more likely to diverge from my intent.</p>
<p>When I requested relatively bite-sized features, the changes were much more obvious, easier to verify, and more often then not- correct.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="courage-boost">Courage boost<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#courage-boost" class="hash-link" aria-label="Direct link to Courage boost" title="Direct link to Courage boost" translate="no">​</a></h4>
<p>I think the biggest advantage gen AI gave me here was the courage to try something that might have been a bit too ambitious otherwise. I certainly could have built this controller without it, but honestly it probably would have doubled or tripled the effort. I would have been much less likely to attempt this within a time-box of a few days.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="next-up">Next up<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#next-up" class="hash-link" aria-label="Direct link to Next up" title="Direct link to Next up" translate="no">​</a></h2>
<p>This experience has got me pumped to double down and get more ambitious. I've got a number of things I want to try next:</p>
<ul>
<li class="">I'd like to see if it's possible to use LLMs to generate integration tests for a legacy system that has so far resisted efforts to be refactored. I'm wondering if an LLM would be capable of finding all the subtle, implicit behaviors that aren't actually visible in the API, but still function like bubblegum holding the larger system together. If we could use it to generate really comprehensive, working tests, it might then give us the confidence to be a lot more aggressive about rearchitecting it.</li>
<li class="">There's a number of OSS projects I've wanted to contribute to, but the effort required to understand their architecture has deterred me. I'd like to try using LLMs as a tool to accelerate my understanding of their structure; not as much as a tool to write code, but just to comprehend it.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="anyone-interested-in-a-slightly-used-k8s-controller">Anyone interested in a slightly used k8s controller?<a href="https://labaneilers.com/using-ai-to-create-a-k8s-controller-in-a-hurry#anyone-interested-in-a-slightly-used-k8s-controller" class="hash-link" aria-label="Direct link to Anyone interested in a slightly used k8s controller?" title="Direct link to Anyone interested in a slightly used k8s controller?" translate="no">​</a></h2>
<p>If anyone is interested in using the controller I built, I'm considering open sourcing it, and I'd love to hear from you. Please reach out!</p>]]></content:encoded>
            <category>platform-engineering</category>
            <category>AI</category>
            <category>opentelemetry</category>
            <category>kubernetes</category>
            <category>prometheus</category>
            <category>cloudwatch</category>
        </item>
        <item>
            <title><![CDATA[Observability Signals: choose wisely]]></title>
            <link>https://labaneilers.com/observability-signals-choose-wisely</link>
            <guid>https://labaneilers.com/observability-signals-choose-wisely</guid>
            <pubDate>Tue, 25 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[A quick review of the "three pillars" of observability: logs, metrics, and traces- their strengths and weaknesses, how to match the right signals to particular use cases.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>A quick review of the "three pillars" of observability: logs, metrics, and traces- their strengths and weaknesses, how to match the right signals to particular use cases.</p><p>The fun part: we'll explore this via a real-world app, adding observability as it grows from a small monolith into a large, microservices architecture.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/three-pillars-6a33e6061ae34624f08f7278a1ac124e.jpg" class="blog-image" alt="Three pillars, inscribed recordum, mensurae, and vestigium">
<p>I spend quite a bit of time helping developers figure out how to make sense of telemetry data they've collected. Often this is showing them to navigate our observability tooling, how to think about visualizing data, or just tossing some saucy query tricks into the mix.</p>
<p>Sometimes though, I find that their telemetry is... ill-conceived. I occasionally find metrics being written like they were a point-in-time event, missing context in logs because someone was worried about adding too much cardinality, or log lines with variable and static context needlessly mashed together. Alas.</p>
<p>This kind of thing makes me wish I'd had a chance catch them earlier, when they were just beginning to consider instrumentation choices.</p>
<p>Given how often this happens, I've built up some insights on some of the most common misconceptions about the different observability signals, what makes them different from one another, and why you'd choose a specific one for a given situation.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-basics-observability-signals-aka-the-three-pillars">The basics: observability signals (a.k.a. the "three pillars")<a href="https://labaneilers.com/observability-signals-choose-wisely#the-basics-observability-signals-aka-the-three-pillars" class="hash-link" aria-label="Direct link to The basics: observability signals (a.k.a. the &quot;three pillars&quot;)" title="Direct link to The basics: observability signals (a.k.a. the &quot;three pillars&quot;)" translate="no">​</a></h2>
<p>Let's review: what do we mean by "observability signals"? We're talking about the venerable "three pillars" of observability: logs, metrics, and traces.</p>
<ul>
<li class=""><strong>Logs</strong> are streams of text that usually represent individual events. Logs are most powerful when they're structured (e.g. JSON or otherwise parsable); they can be arbitrarily "wide" and can have many, high-cardinality fields (no limit on the number of possible values per field).</li>
<li class=""><strong>Metrics</strong> are series of numeric measurements over time, labeled with key/value pairs. Metrics represent <em>aggregations</em> of data (e.g. sums, counts, percentiles) and <em>not</em> individual events. Time-series databases store metrics in an extremely compact and cost-effective way, with the tradeoff that they require relatively low cardinality value labels (a low number of unique combinations of labels) to operate efficiently.</li>
<li class=""><strong>Traces</strong> are a special kind of logs that are rigidly structured to provide context about a set of related operations that occur across a (usually distributed) system. Individual events in traces are called "spans", which represent operations with a start/end time, and are linked together in a nested structure via correlating IDs. Spans can also contain arbitrarily wide, high-cardinality fields.</li>
</ul>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>fancy, schmancy</div><div class="admonitionContent_BuS1"><p>There's other, more exotic signals (e.g. profiles) but we're going to stick to meat and potatoes here.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="enough-with-the-theory-already">Enough with the theory already<a href="https://labaneilers.com/observability-signals-choose-wisely#enough-with-the-theory-already" class="hash-link" aria-label="Direct link to Enough with the theory already" title="Direct link to Enough with the theory already" translate="no">​</a></h2>
<p>Blah, blah, blah. I'm dying for a real-world example. How about you?</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="introducing-petface">Introducing PetFace<a href="https://labaneilers.com/observability-signals-choose-wisely#introducing-petface" class="hash-link" aria-label="Direct link to Introducing PetFace" title="Direct link to Introducing PetFace" translate="no">​</a></h3>
<p>Let's say we've got a startup idea for a mobile app: it allows users to select pictures of their pets, upload a video, and then some fancy AI processes them and substitutes the faces of anyone in the video with the pets' faces. Hilarity ensues.</p>
<p>Let's call this app "PetFace".</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Patent pending</div><div class="admonitionContent_BuS1"><p>Don't steal my app idea. This is going to make me rich.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-of-petface">The architecture of PetFace<a href="https://labaneilers.com/observability-signals-choose-wisely#the-architecture-of-petface" class="hash-link" aria-label="Direct link to The architecture of PetFace" title="Direct link to The architecture of PetFace" translate="no">​</a></h3>
<p>For the supporting backend of this app, let's build a little API server which does the following:</p>
<ul>
<li class="">receives pet pictures and a video from the mobile app via an API</li>
<li class="">runs the fancy generative AI to replace human faces with pet faces</li>
<li class="">writes the processed video to S3</li>
<li class="">updates a database record to indicate the video processing is complete</li>
</ul>
<p>The app would poll the API server to check the status of the video processing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="making-petface-observable">Making PetFace observable<a href="https://labaneilers.com/observability-signals-choose-wisely#making-petface-observable" class="hash-link" aria-label="Direct link to Making PetFace observable" title="Direct link to Making PetFace observable" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="lets-start-with-some-nice-structured-logs">Let's start with some nice, structured logs<a href="https://labaneilers.com/observability-signals-choose-wisely#lets-start-with-some-nice-structured-logs" class="hash-link" aria-label="Direct link to Let's start with some nice, structured logs" title="Direct link to Let's start with some nice, structured logs" translate="no">​</a></h3>
<p>OK, let's say we've got the MVP of the app done, and now we want to set up some observability.</p>
<p>99% of the time, the first thing to do is just sprinkle in some logging. Here's an example of some nice, structured log events:</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>These are multi-line and indented for readability. Don't actually write your logs with line breaks.</p></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">&lt;ts&gt; </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  level=info</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  event=HttpRequest</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  message="POST /v1/video"</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  status=201</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  request_id=abcdefg </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  pet_pics_count=3 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  duration_ms=7830 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  user_id=12345 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  instance=api-3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">&lt;ts&gt; </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  level=info</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  event=VideoReceived</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  message="video uploaded successfully" </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  request_id=abcdefg </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  pet_pics_count=3 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  duration_ms=7830 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  user_id=12345 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  instance=api-3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">&lt;ts&gt; </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  level=info</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  event=VideoProcessed</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  message="video processed successfully" </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  request_id=abcdefg</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  duration_ms=13204</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  user_id=12345 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  instance=api-3</span><br></span></code></pre></div></div>
<p>We'll also need some additional event types (e.g. "VideoSaved", "DatabaseUpdated") and probably also a message type for errors:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">&lt;ts&gt; </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  level=error</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  event=VideoProcessingError</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  message="video processing failed: too many pet photos" </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  request_id=abcdefg </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  user_id=12345 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  instance=api-3 </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  error=ETOOMANYPETS</span><br></span></code></pre></div></div>
<p>As we build, test, and troubleshoot the PetFace backend, these log messages will be invaluable for understanding what's going on for any individual request.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="deriving-metrics-from-logs">Deriving metrics from logs<a href="https://labaneilers.com/observability-signals-choose-wisely#deriving-metrics-from-logs" class="hash-link" aria-label="Direct link to Deriving metrics from logs" title="Direct link to Deriving metrics from logs" translate="no">​</a></h3>
<p>As PetFace's usage begins to grow, scanning raw log lines with our eyeballs to try to determine the health of the backend isn't going to cut it. To reason about large amounts of data, we're going to want to use the magic of math to extract aggregate, numeric time-series we can visualize.</p>
<p>In this case, we may want to start drawing graphs to represent:</p>
<ul>
<li class=""><em>How many requests are we getting per second?</em></li>
<li class=""><em>What are the average, 95th percentile, and 99th percentile of video processing time?</em></li>
<li class=""><em>What is the percentage of requests that result in an error?</em></li>
</ul>
<p>Luckily, our logs already contain all the data we need in structured fields. If you've got a log query language that supports grouping and some aggregation functions, you can generate time-series from your logs.</p>
<p>Here's an example: Here we're using LogQL (the query language for Loki) to get <code>requests per second</code> with LogQL. First, here's a basic log selector expression that gets the request logs for PetFace in production:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># get production logs for PetFace</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">service_name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"PetFace"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">env</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"production"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># parses the key/value pairs</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> logfmt </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># select only HTTP request logs</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> event</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"HttpRequest"</span><span class="token plain"> </span><br></span></code></pre></div></div>
<!-- -->
<figure><img src="https://labaneilers.com/assets/images/logs-d82da092b54363533e873e4478725df3.png" alt="Logs"><figcaption>Raw request logs</figcaption></figure>
<p>Now we can convert it into a single metric series by wrapping it in <code>rate()</code> and <code>sum()</code> functions to calculate requests per second over time:</p>
<div class="language-javascript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-javascript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">sum</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token function" style="color:#d73a49">rate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">service_name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"PetFace"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">env</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"production"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> logfmt </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token operator" style="color:#393A34">|</span><span class="token plain"> event</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"HttpRequest"</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">$__auto</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<!-- -->
<figure><img src="https://labaneilers.com/assets/images/series-e61a8ca7b065ff6c185771d38f3b4a2c.jpg" alt="A time-series graph"><figcaption>Requests per second</figcaption></figure>
<p>Voila! Now we can use our visual cortex (a.k.a. the GPU inside our brain) to comprehend <em>tons</em> more data than we could possibly hope to by scanning through log lines! We can spot trends, anomalous blips, and build an intuitive sense of what healthy vs unhealthy looks like for our app.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>metrics are not events</div><div class="admonitionContent_BuS1"><p>This helps illustrate one of the most fundamental misconceptions about metrics, and how they differ from logs:</p><ul>
<li class="">A log line is an event: something happened at one point in time.</li>
<li class="">A metric timeseries is a long-running series of measurements <em>over time</em> that represent <em>all the events that happened</em> since the last data point</li>
</ul></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="logs-get-expensive-at-scale">Logs get expensive at scale<a href="https://labaneilers.com/observability-signals-choose-wisely#logs-get-expensive-at-scale" class="hash-link" aria-label="Direct link to Logs get expensive at scale" title="Direct link to Logs get expensive at scale" translate="no">​</a></h3>
<p>PetFace is blowing up! We're scaling our backend dramatically, and our bill for logs storage is killing us. Logs are super powerful and flexible, but they cost a lot; not just for storage, but also for our code to generate, serialize, buffer, write to stdout, and ultimately harvest and transmit to our observability backend.</p>
<p>But we're using the logs to generate metrics that have become <em>really, really</em> valuable. We've using them to drive alerts to tell us when something has gone off the rails, to project our growth, and to tune our system. We can't live without them.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="do-metrics-aggregation-directly-in-the-app">Do metrics aggregation directly in the app<a href="https://labaneilers.com/observability-signals-choose-wisely#do-metrics-aggregation-directly-in-the-app" class="hash-link" aria-label="Direct link to Do metrics aggregation directly in the app" title="Direct link to Do metrics aggregation directly in the app" translate="no">​</a></h3>
<p>What if, instead of generating the metrics from the logs, we just do the metrics aggregation directly in our app? This would not only save the cost of the logs storage, but also remove all the overhead of generating the logs from the app.</p>
<p>For example: to track request rate per second, we could just have an integer counter that we increment on each request, and then send the sum total to a timeseries database at a regular interval (e.g. 1 minute).</p>
<p>Timeseries databases are <em>incredibly cheap</em> to run compared to log aggregators. If we are collecting metrics directly, we can decrease the log level (i.e. change a setting to only emit log lines with <code>level=warning</code> or above), and save a ton of money on log storage, without losing these critical metrics.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>The tradeoff with metrics</div><div class="admonitionContent_BuS1"><p>While this is a dramatic cost optimization, we should note the tradeoff: replacing logs with metrics is a <em>lossy</em> conversion. Metrics don't have the rich, high-cardinality context that you get with logs.</p><p>For example, you can't look at a blip in an "error rate" time-series and use it to diagnose a specific failed request- that information was lost when the metric was aggregated.</p><p>Metrics are a key lever available to you to manage observability costs: as your system becomes more mature, you can use metrics to distill down the most important indicators of your system's health. This can allow you to tune down the verbosity of your logs, saving lots of money.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-to-use-metrics-completely-wrong">How to use metrics completely wrong<a href="https://labaneilers.com/observability-signals-choose-wisely#how-to-use-metrics-completely-wrong" class="hash-link" aria-label="Direct link to How to use metrics completely wrong" title="Direct link to How to use metrics completely wrong" translate="no">​</a></h3>
<p>Let's say we've hired a bright-eyed young college intern at PetFace, and have asked them to implement a new feature: a button which allows a user to share their pet-faced videos with their friends by email.</p>
<p>This intern is super talented, and builds the feature in no time. The final requirement is to add some observability to track adoption of the new feature, so they add a new line of code which gets run once a user clicks the "send" button.</p>
<p>The intern figures that this metric will be most useful if it contains the full context of the email sending function:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">send_video_to_friend</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> video_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> friend_email</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># build content for an email to the user's friend</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    email_body </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> build_email_body</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> video_id</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Send the email</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    result </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> send_email</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">email_body</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> friend_email</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    metrics</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">increment_counter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"petface_friend_email_sent"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      labels</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"user_id"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> user_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"video_id"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"friend_email"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> friend_email</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"success"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> result</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">success</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<div class="theme-admonition theme-admonition-danger admonition_xJq3 alert alert--danger"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M5.05.31c.81 2.17.41 3.38-.52 4.31C3.55 5.67 1.98 6.45.9 7.98c-1.45 2.05-1.7 6.53 3.53 7.7-2.2-1.16-2.67-4.52-.3-6.61-.61 2.03.53 3.33 1.94 2.86 1.39-.47 2.3.53 2.27 1.67-.02.78-.31 1.44-1.13 1.81 3.42-.59 4.78-3.42 4.78-5.56 0-2.84-2.53-3.22-1.25-5.61-1.52.13-2.03 1.13-1.89 2.75.09 1.08-1.02 1.8-1.86 1.33-.67-.41-.66-1.19-.06-1.78C8.18 5.31 8.68 2.45 5.05.32L5.03.3l.02.01z"></path></svg></span>This is a cautionary example</div><div class="admonitionContent_BuS1"><p>Can you spot the problem?</p></div></div>
<p>I've mentioned previously that metric labels should be <em>low cardinality</em>. In this example, we've introduced <em>very high</em> cardinality: there's going to be a new, unique metric series for <em>every single combination</em> of <code>user_id</code>, <code>video_id</code>, and <code>friend_email</code>, which all have unbounded possible values.</p>
<p>Egads! That's effectively <strong>infinite cardinality</strong>, which will rapidly bring any time-series database (e.g. Prometheus, InfluxDB) to its knees- or cause your observability vendor bill to explode.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Time-series should be... series</div><div class="admonitionContent_BuS1"><p>The metrics code above isn't even really creating "series", since it's emitting only <em>one</em> data point per unique combination of labels. The whole point of using time-series is to track the <em>change in a measurement over time</em>.</p><p>When you have "series" with one data point each, that's a strong smell you're doing something wrong.</p></div></div>
<p>Metrics defined in code are for answering questions about <strong>aggregate quantities</strong>, <strong>not individual events</strong>. Here's some questions we could answer effectively with metrics:</p>
<ul>
<li class=""><em>How many video emails are being sent per minute?</em></li>
<li class=""><em>What percentage of email sending attempts fail?</em></li>
</ul>
<p>To answer these, we don't need <code>user_id</code>, <code>video_id</code>, or <code>friend_email</code>; we only need to count the number of sent emails, including the label <code>success</code>, which has only 2 values: true or false. This results in only 2 series. That's wicked cheap!</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Important!</div><div class="admonitionContent_BuS1"><p>When you define metrics in code, you're <strong>deciding in advance what questions you want to answer</strong>. They <strong>are not</strong> useful for answering arbitrary, novel questions.</p></div></div>
<p>Since video sharing emails are a brand new feature, we may not have any idea what questions we want to answer up front. We may want to a series of more ad-hoc, open-ended questions like:</p>
<ul>
<li class=""><em>Which users send the most emails? Does it correlate to their subscription tier? Country of origin? Length of the video?</em></li>
<li class=""><em>Which users are associated with the highest send failure rates? What could be causing that?</em></li>
</ul>
<p>This would have been an <em>excellent</em> usage of a log line. You can include all kinds of context in the log line, and then use your observability backend tools to slice and dice the data to answer a much broader range of questions compared to the metric.</p>
<p>Here's the same code, after a review from a seasoned mentor:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">send_video_to_friend</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> video_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> friend_email</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># build content for an email to the user's friend</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    email_body </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> build_email_body</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">user_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> video_id</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Send the email</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    result </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> send_email</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">email_body</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> friend_email</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    metrics</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">increment_counter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"petface_friend_email_sent"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      labels</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"success"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> result</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">success</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    logger</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">info</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      event</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"FriendEmailSent"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      message</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"friend email sent"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      labels</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"user_id"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> user_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"video_id"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> video_id</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"friend_email"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> friend_email</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"success"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> result</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">success</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Metrics vs logs: cost scaling</div><div class="admonitionContent_BuS1"><p>Tracking the metric <code>petface_friend_email_sent</code> would cost pennies regardless of how wildly successful this feature becomes, while the cost of the log line would scale linearly with number of requests- and could get expensive at high volumes.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="did-you-forget-about-traces">Did you forget about traces?<a href="https://labaneilers.com/observability-signals-choose-wisely#did-you-forget-about-traces" class="hash-link" aria-label="Direct link to Did you forget about traces?" title="Direct link to Did you forget about traces?" translate="no">​</a></h3>
<p>Up to this point, PetFace was essentially a single, horizontally scaling backend API server. While traces can be used in a monolithic app (e.g. to visualize key functions in the call stack), their utility is more limited.</p>
<p>Let's imagine PetFace has gone absolutely gangbusters, had a successful IPO, and now has 15 different dev teams working on dozens of different supporting microservices: advertising placements, data collection for model training, social networking features, etc. We've even got yoga classes and catered lunches <em>two days a week</em>!</p>
<p>At this scale, system failures are a lot harder to diagnose with logs or metrics. There's all kinds of complex, second-order effects, backpressure, and subtle failure modes across multiple services in the system.</p>
<p>For example, when the video processing microservice's p95 performance suddenly takes a nosedive, everyone starts scrambling around trying to figure out the root cause, but it's hard to get a broad view of the whole system when each service's logs and metrics are separate.</p>
<p>This is where tracing absolutely shines.</p>
<p>Let's take the example above, but imagine we have our system instrumented with OpenTelemetry tracing, pointing to a backend like Honeycomb or Grafana Tempo. These tools allow someone to quickly identify a few example traces representing the slow video processing requests, drill down into the spans in the trace- <em>across all microservices transitively involved in the operation</em>. Spans are visualized as a tree, where the width of each span represents duration, and child operations are nested recursively underneath.</p>
<!-- -->
<figure><img src="https://labaneilers.com/assets/images/traces-0025ff7d07d490a0b377bdf5246434b3.png" alt="Visualization of traces"><figcaption>Visualizing traces</figcaption></figure>
<p>Behold, the power of the correlation ID!</p>
<p>Now you can easily see that, 3 services deep, there's a slow PostgreSQL request from the ad placement service. Oh crap- it looks like there's lock contention in the database during high load!</p>
<p>Wow- traces are amazing. I mentioned before that the spans that make up traces are basically just a special, rich type of log, and that metrics can be derived from logs. So if we have tracing... why would we want to use any other signal?</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-3-signals-are-complementary">The 3 signals are complementary<a href="https://labaneilers.com/observability-signals-choose-wisely#the-3-signals-are-complementary" class="hash-link" aria-label="Direct link to The 3 signals are complementary" title="Direct link to The 3 signals are complementary" translate="no">​</a></h3>
<p>There's a few reasons why the 3 different signals are complementary:</p>
<ul>
<li class="">Traces have roughly the same unit economics that logs do- they can get <em>very</em> expensive at scale. To make traces more affordable, many companies will do random sampling of traces (e.g. collect 1 out of every 5 traces), which will reduce the cost significantly, but also make the traces a lot less useful for forensic use cases (e.g. diagnosing a particular bad request). Sampling is also complex, and can be error prone.</li>
<li class="">There's a bunch of use cases that traces don't do a good job of handling today (e.g. long running operations, async operations, capturing state from a crash, representing background processes)</li>
<li class="">Traces are a lot newer and less mature than logs and metrics; the tooling ecosystems are more limited.</li>
</ul>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Want to deep dive on this topic?</div><div class="admonitionContent_BuS1"><p>If you're interested in a deeper dive into my opinion that we still need logs and metrics (and why some very smart people disagree with me), check out my previous post: <a class="" href="https://labaneilers.com/are-we-ready-for-observability-2.0">Are we ready for Observability 2.0?</a>.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="cheat-sheet">Cheat sheet<a href="https://labaneilers.com/observability-signals-choose-wisely#cheat-sheet" class="hash-link" aria-label="Direct link to Cheat sheet" title="Direct link to Cheat sheet" translate="no">​</a></h2>
<p>I'll leave you with some quick heuristics to help you match signals to use cases:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-logs">When to use logs<a href="https://labaneilers.com/observability-signals-choose-wisely#when-to-use-logs" class="hash-link" aria-label="Direct link to When to use logs" title="Direct link to When to use logs" translate="no">​</a></h3>
<ul>
<li class="">When you're building a new application. Start with structured logs, and log liberally.</li>
<li class="">Forever after. Logs are always useful to help ask arbitrary, novel questions of your system.</li>
<li class="">When you need the ability to retroactively troubleshoot specific, individual events.</li>
<li class="">When you have strict requirements for auditing, security, or regulatory compliance.</li>
<li class="">For edge cases that traces don't handle well: capturing crash data, initialization and background processes, correlation of related async tasks, etc</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-metrics">When to use metrics<a href="https://labaneilers.com/observability-signals-choose-wisely#when-to-use-metrics" class="hash-link" aria-label="Direct link to When to use metrics" title="Direct link to When to use metrics" translate="no">​</a></h3>
<ul>
<li class="">To track important numeric indicators of system or business health over time (e.g. request counts, error counts, CPU temperature, WiFi signal strength, request duration distributions) that you probably want to alert on (e.g. request rate drops, error rate jumps)</li>
<li class="">When your log volume starts to get expensive, and your system is mature/stable enough to reduce the verbosity/frequency of your logs, and you can fill in the resulting gaps with some key time-series</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-traces">When to use traces<a href="https://labaneilers.com/observability-signals-choose-wisely#when-to-use-traces" class="hash-link" aria-label="Direct link to When to use traces" title="Direct link to When to use traces" translate="no">​</a></h3>
<ul>
<li class="">When your system grows more complex, especially when you break it up into multiple services (or APIs, databases, etc)</li>
<li class="">To provide a deep understanding the behavior and topology of your (usually distributed) system, including the ability to ask ad-hoc, arbitrary questions</li>
<li class="">To help understand the performance characteristics of your system, and especially to find the root cause of performance problems</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="general-tips">General tips<a href="https://labaneilers.com/observability-signals-choose-wisely#general-tips" class="hash-link" aria-label="Direct link to General tips" title="Direct link to General tips" translate="no">​</a></h3>
<ul>
<li class="">If you've already got sufficient visibility into a specific operation from one signal, and another signal would be duplicative, you may want to drop the one that's less rich (e.g. maybe drop a log line if you're already capturing an un-sampled span).</li>
<li class="">Don't prematurely optimize for cost; lean on richer signals (traces, logs) and avoid sampling until it becomes an actual problem. The ability to ask ad-hoc questions of your system is very important, especially early in a product's lifecycle.</li>
</ul>]]></content:encoded>
            <category>observability</category>
            <category>opentelemetry</category>
            <category>devops</category>
            <category>platform-engineering</category>
        </item>
        <item>
            <title><![CDATA[OpenTelemetry's secret weapon]]></title>
            <link>https://labaneilers.com/opentelemetry-secret-weapon</link>
            <guid>https://labaneilers.com/opentelemetry-secret-weapon</guid>
            <pubDate>Wed, 12 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[As OpenTelemetry's adoption has surged, it's drawn increasing criticism: it's complex, isn't fully matured, and its user experience can feel... unpolished. While these are valid gripes, I think we've hit an inflection point where OTel's benefits outweigh its pain points, especially when compared to the alternative of proprietary telemetry pipelines and lock-in with the dominant (and outrageously expensive) vendors.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>As OpenTelemetry's adoption has surged, it's drawn increasing criticism: it's complex, isn't fully matured, and its user experience can feel... unpolished. While these are valid gripes, I think we've hit an inflection point where OTel's benefits outweigh its pain points, especially when compared to the alternative of proprietary telemetry pipelines and lock-in with the dominant (and outrageously expensive) vendors.</p><p>In the next year or so, I think its benefits are going to increase dramatically due to its secret weapon: <strong>semantic conventions</strong>. These conventions allow any observability vendor to create the same rich, powerful, out-of-the-box user experiences that the dominant players had locked-down via their ownership of the entire telemetry pipeline.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/otel-secret-weapon-1552bd579b516cf0abc63bcc4530db86.jpg" class="blog-image" alt="A superhero holding a secret weapon">
<p>Now that OpenTelemetry has gained such significant traction, it's starting to attract a lot of attention beyond the hardcore observability community. While most of what I read about OTel is pretty positive, it also draws its fair share of shade.</p>
<p>Honestly, I get it. As a big user of OTel, I've spent plenty of hours rage debugging OTTL filters in OTel collectors, desperately searching for SDK examples that actually work, or pulling my hair out trying to figure out which version of a protobuf schema changed and broke telemetry from my Swift clients. And like a lot of the haters, I also get frustrated that the overall level of complexity in the specifications leaks into the implementation details of every part of the project.</p>
<p>All that said, I'm incredibly thankful that OTel exists. While it's still in its awkward teenage years, it's already changing the industry dramatically. Despite the challenges that currently exist, I think <strong>OpenTelemetry is the only reasonable path forward for observability</strong>.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="so-like-i-totally-already-know-what-opentelemetry-is-but-just-give-me-a-little-refresher">So, like, I totally already know what OpenTelemetry is, but just give me a little refresher<a href="https://labaneilers.com/opentelemetry-secret-weapon#so-like-i-totally-already-know-what-opentelemetry-is-but-just-give-me-a-little-refresher" class="hash-link" aria-label="Direct link to So, like, I totally already know what OpenTelemetry is, but just give me a little refresher" title="Direct link to So, like, I totally already know what OpenTelemetry is, but just give me a little refresher" translate="no">​</a></h2>
<p>OpenTelemetry is an umbrella project with the broad and audacious mission to provide a set of open tools and standards for building telemetry pipelines.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>WTF is a telemetry pipeline?</div><div class="admonitionContent_BuS1"><p>A telemetry pipeline is a set of tools that create, collect, and process telemetry data before sending it to a backend (e.g. Prometheus, DynaTrace, Honeycomb, etc) for storage and analysis. This includes:</p><ul>
<li class="">The libraries/agents that instrument your application, and the code that uses them to record events, take measurements, collect performance data, and emit it as metrics, logs, traces, etc.</li>
<li class="">Any intermediate tools that collect the data, process/filter it, and forward it to a backend (e.g. DogStatsD, LogStash, Fluent Bit, Telegraf, etc)</li>
</ul></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-before-times">The before times<a href="https://labaneilers.com/opentelemetry-secret-weapon#the-before-times" class="hash-link" aria-label="Direct link to The before times" title="Direct link to The before times" translate="no">​</a></h3>
<p>Before OTel, the tools available were mostly proprietary and vendor-specific. This has some big downsides:</p>
<ul>
<li class=""><strong>Black boxes</strong>: Telemetry agents/libraries/SDKs (the thing you install into your app to collect metrics, logs, etc) were closed source, not modifiable or verifiable by users. If they didn't work <em>exactly</em> as you need, you could really only hope to persuade your vendor to add a feature for you.</li>
<li class=""><strong>Vendor lock-in</strong>: Since most telemetry agents are vendor-specific, switching between them can be <em>very</em> labor intensive, which gives vendors enormous leverage to develop predatory, rent-seeking pricing strategies</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="ok-so-what-actually-is-opentelemetry">OK, so what <em>actually</em> is OpenTelemetry?<a href="https://labaneilers.com/opentelemetry-secret-weapon#ok-so-what-actually-is-opentelemetry" class="hash-link" aria-label="Direct link to ok-so-what-actually-is-opentelemetry" title="Direct link to ok-so-what-actually-is-opentelemetry" translate="no">​</a></h3>
<p>OpenTelemetry provides an alternative ecosystem, starting with some <strong>open specifications</strong>:</p>
<ul>
<li class="">A standard telemetry protocol (OTLP) that defines the low-level shape of telemetry data (including traces, metrics, and logs) and how it gets transmitted</li>
<li class="">A lightweight, standardized API that can be used to add instrumentation to your code</li>
<li class="">SDKs that implement the API and configure your applications to export telemetry data</li>
<li class="">Semantic conventions that define the meaning of fields in telemetry data</li>
</ul>
<p>And then, more practically for users, there's some specific software:</p>
<ul>
<li class=""><strong>Language-specific SDKs</strong>: A set of (mostly language-specific) SDKs that implement the SDK and API specifications. This is what you'd install into your apps (as an alternative to a vendor-specific agent). You'd use the APIs to sprinkle  manual instrumentation throughout your first-party code, and framework/library authors would use the API to add it to theirs.</li>
<li class=""><strong>Auto-instrumentation</strong>: A vast ecosystem of auto-instrumentation libraries and tools that automatically instrument all the most popular frameworks (e.g. Django, Express, Spring etc) and libraries (http, Postgres, Redis, Mongo, etc), using the API under the hood.</li>
<li class=""><strong>The OpenTelemetry Collector</strong>: A server application that's a sort of "universal translator" for telemetry data. It can accept telemetry in virtually any format you can imagine (e.g. statsd, Prometheus, Zipkin, InfluxDB, Loki, etc), apply arbitrary transformations, filters, and enrichments, and then export the data to an equally astonishing number of destinations formats (e.g. Jaeger, Prometheus, Splunk, Honeycomb, DataDog, Kafka, ClickHouse, etc).</li>
<li class=""><strong>The OTel ecosystem</strong>: A huge ecosystem of tooling (e.g. Kubernetes operators, helm charts, eBPF agents) that compose the SDKs, collector, protocols, etc, to create new capabilities.</li>
</ul>
<p>And the part of OTel that astonishes me the most:</p>
<ul>
<li class=""><strong>Vendor support</strong>: Basically <strong>every observability vendor you can think of</strong> now supports OTel.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wait-why-are-all-the-vendors-supporting-opentelemetry">Wait, why are all the vendors supporting OpenTelemetry?<a href="https://labaneilers.com/opentelemetry-secret-weapon#wait-why-are-all-the-vendors-supporting-opentelemetry" class="hash-link" aria-label="Direct link to Wait, why are all the vendors supporting OpenTelemetry?" title="Direct link to Wait, why are all the vendors supporting OpenTelemetry?" translate="no">​</a></h2>
<p>The most incredible thing about OpenTelemetry is that it even exists at all, much less that it's supported by virtually all observability vendors. Why would they voluntarily back a project that promotes vendor-neutrality, and <em>undermines their ability to lock-in customers</em>?</p>
<p>Early on, there was certainly an advantage to smaller vendors adopting OTel as a competitive differentiator. What's surprising is how much pressure this put on the industry, which slowly moved up-market, like a growing tsunami, until even the most dominant vendors felt pressure to offer at least some support (or lip-service for support).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-actual-threat-of-opentelemetry">The actual threat of OpenTelemetry<a href="https://labaneilers.com/opentelemetry-secret-weapon#the-actual-threat-of-opentelemetry" class="hash-link" aria-label="Direct link to The actual threat of OpenTelemetry" title="Direct link to The actual threat of OpenTelemetry" translate="no">​</a></h2>
<p>Before OTel, the dominant vendors had for years enjoyed an advantage that wasn't particularly obvious: <em>their end-to-end ownership of the telemetry pipeline gave them full understanding of the meaning of the data they collected</em>.</p>
<p>If you've ever used one of the big tools, like DataDog or New Relic, you've probably experienced the customer experience benefits of this capability first hand:</p>
<ol>
<li class="">Install the vendor's proprietary agent into your apps (directly, or through some platform-wide integration)</li>
<li class="">Open up the vendor's dashboard, and instantly see gorgeous dashboards with all kinds of pre-built visualizations and alerts, all out of the box. Revel in the insights and ease of use.</li>
</ol>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>A few more steps</div><div class="admonitionContent_BuS1"><p>I forgot to mention a couple other steps:</p><ol start="3">
<li class="">Get your first bill, freak out at the number of digits, and instruct your team to agonize over every instrumentation choice in a futile effort to manage costs</li>
<li class="">As the contract renewal negotiation approaches, realize you're completely fucked now that you've fully integrated their proprietary telemetry pipeline</li>
</ol></div></div>
<p>The obvious part is that OTel threatens this advantage by giving us the ability to create our own telemetry pipelines. But the real kicker, hidden in the list of stuff above that OTel does, is what I think the most powerful and underappreciated part of the project:</p>
<ul>
<li class=""><strong>Semantic conventions that define the meaning of fields in telemetry data</strong></li>
</ul>
<p>Why is this item so important?</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="semantic-conventions-are-what-made-the-dominant-observability-tools-so-damn-useful">Semantic conventions are what made the dominant observability tools so damn useful<a href="https://labaneilers.com/opentelemetry-secret-weapon#semantic-conventions-are-what-made-the-dominant-observability-tools-so-damn-useful" class="hash-link" aria-label="Direct link to Semantic conventions are what made the dominant observability tools so damn useful" title="Direct link to Semantic conventions are what made the dominant observability tools so damn useful" translate="no">​</a></h2>
<p>When you open that first dashboard in DataDog or NewRelic, you feel like you're standing in the Pentagon war room or something. There's so many graphs with data you've been dying to visualize- especially about your infrastructure- that were clearly very carefully curated. Your Kubernetes clusters are suddenly transparent and stripped of all their mystery. Within just a few seconds you can get a broad sense of the state of your system, and then drill down with a few clicks to find individual logs representing anomalies. The tooling gives you the ability to correlate data from different signals, allowing you to jump back and forth between aggregate performance graphs and individual exemplar pod data.</p>
<p>The dominant, legacy vendors have been able to create this user experience because they've utilized their end-to-end ownership of the telemetry pipeline to establish (proprietary) conventions about the <em>meaning of the data they're collecting</em>.</p>
<ul>
<li class="">They scoop up all that rich Kubernetes and cloud provider metadata (pods, nodes, namespaces, EC2 instance type, region, AMI ID, etc) and use it to enrich metrics and log lines with labels</li>
<li class="">They collect performance data using specific conventions and units</li>
<li class="">They auto-instrument your applications using language-specific instrumentation agents that know how to instrument all the most popular libraries and frameworks, using consistent metadata across all of them</li>
<li class="">They track the lifecycle of requests through your system, and correlate signals using identifiers sent through as (e.g.) custom HTTP headers</li>
</ul>
<p>By the time the telemetry data gets to their system, they already know an enormous amount about what it all actually means. They can then use this to create higher-level visualizations which understand all that Kubernetes and cloud provider metadata, supply canned alerts that handle a bunch of common Kubernetes failure modes, or build maps of services that show all the runtime relationships by using the correlation IDs they injected into your HTTP requests.</p>
<p>This is why I think of OpenTelemetry's semantic conventions as a secret weapon. Now <em>all observability vendors</em> have the ability to go beyond just being able to ingest data from an OTel-based pipeline; they can now add this kind of rich, powerful, out-of-the-box user experience powered by a <em>deep understanding of the meaning of the data they're collecting</em>.</p>
<p>And increasingly, they are!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="its-not-perfect-but-its-already-better-than-the-alternative">It's not perfect, but it's already better than the alternative<a href="https://labaneilers.com/opentelemetry-secret-weapon#its-not-perfect-but-its-already-better-than-the-alternative" class="hash-link" aria-label="Direct link to It's not perfect, but it's already better than the alternative" title="Direct link to It's not perfect, but it's already better than the alternative" translate="no">​</a></h2>
<p>So when I read posts criticizing OpenTelemetry for its complexity, or its lack of maturity, or claiming that it's designed by committee, or it's hindered by having to satisfy the lowest common denominator, I can simultaneously agree with all that, while still believing that it's still the best way forward.</p>
<p>Like most technical choices, this is fundamentally about tradeoffs, and I think OpenTelemetry has hit the inflection point where its benefits have begun to outweigh its pain points for many, many situations. I'm acutely aware that there's still a lot of friction in getting OTel set up, but you've got to compare it to the alternatives.</p>
<p>OpenTelemetry has finally become a workable enough solution that allows an organization, with enough effort, to develop a top shelf observability capability - without having to trade-in their firstborn child in exchange for, say, a few months of custom metrics.</p>
<p>And OTel is only going to get better. So many brilliant people are contributing, vendors are investing heavily, the ecosystem is growing, and adoption is surging.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="an-otel-anecdote-that-brings-me-joy">An OTel anecdote that brings me joy<a href="https://labaneilers.com/opentelemetry-secret-weapon#an-otel-anecdote-that-brings-me-joy" class="hash-link" aria-label="Direct link to An OTel anecdote that brings me joy" title="Direct link to An OTel anecdote that brings me joy" translate="no">​</a></h2>
<p>I'd like to leave you with one of my favorite moments in OpenTelemetry from the last year:</p>
<p>The OpenTelemetry community implemented a <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/datadogreceiver" target="_blank" rel="noopener noreferrer" class="">DataDog receiver for the OTel collector</a>. When I say <em>the community</em>,  I'm saying it specifically did NOT come from DataDog; it was driven by community members and <em>competing vendors</em> (e.g. Grafana Labs, Splunk). This receiver, along with some <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/deltatocumulativeprocessor" target="_blank" rel="noopener noreferrer" class="">supporting processors</a>, allows organizations that are stuck using DataDog's proprietary agents to redirect their telemetry data to any vendor that supports OTel.</p>
<p>Imagine how DataDog sales felt the first time this came up in a contract negotiation!</p>]]></content:encoded>
            <category>observability</category>
            <category>opentelemetry</category>
            <category>devops</category>
            <category>platform-engineering</category>
        </item>
        <item>
            <title><![CDATA[Are we ready for Observability 2.0?]]></title>
            <link>https://labaneilers.com/are-we-ready-for-observability-2.0</link>
            <guid>https://labaneilers.com/are-we-ready-for-observability-2.0</guid>
            <pubDate>Thu, 30 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Observability 2.0 is a vision of observability that seeks to replace the traditional "three pillars" of observability (metrics, logs, and traces) with a single source of truth: wide events.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>Observability 2.0 is a vision of observability that seeks to replace the traditional "three pillars" of observability (metrics, logs, and traces) with a single source of truth: wide events.</p><p>This vision is compelling, but there are a number of obstacles that make it difficult to adopt in practice. We're now thinking about Observability 2.0 as a philosophy we can work towards gradually.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/bee-69ae3c0ffe756b7d6ce5fac5f5a5e0e8.jpg" class="blog-image" alt="A bee looking through a telescope">
<p>At SimpliSafe, we manage a pretty large system of microservices. Because we're entrusted by our customers to protect their homes and families, we take reliability of our systems pretty seriously, so observability is pretty important to us.</p>
<p>Like most companies today, our observability strategy is built around the "three pillars" approach: metrics, logs, and traces. Of the three, we're currently the most dissatisfied with our logging tooling, and have been working on finding a better product.</p>
<p>We're already a Honeycomb customer, and in our conversation with them, they ended up making a pretty interesting case that we should consider a new approach: ditch the three pillars and <em>make traces the center of our strategy</em>. We had up to this point been thinking very incrementally, and here was Honeycomb, coming in hot with a bold and revolutionary vision: Observability 2.0.</p>
<!-- -->
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="wtf-is-observability-20">WTF is Observability 2.0?<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#wtf-is-observability-20" class="hash-link" aria-label="Direct link to WTF is Observability 2.0?" title="Direct link to WTF is Observability 2.0?" translate="no">​</a></h3>
<p>The term "<a href="https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0" target="_blank" rel="noopener noreferrer" class="">Observability 2.0</a>" was coined by Charity Majors (the CTO and co-founder of Honeycomb) as a shorthand for a new vision of observability, defined in opposition to the "three pillars" model (metrics, logs, and traces).</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>charity.wtf</div><div class="admonitionContent_BuS1"><p>If you're somehow one of the 5 technology people on the internet that isn't already aware of <a href="https://charity.wtf/" target="_blank" rel="noopener noreferrer" class="">Charity's blog</a>, you should really stop reading this drivel and head there immediately. She's a fountain of insight on technology, engineering management, observability, and many other topics.</p><p>Also, I love her writing style, and her use of profanity as a persuasive device is fucking delightful.</p></div></div>
<p>Here's the gist of Observability 2.0:</p>
<ul>
<li class="">Collect telemetry in the form "wide events" (usually traces, but also maybe structured logs) which can contain an arbitrary number of key/value pairs with high cardinality values</li>
<li class="">Your observability tooling should allow you to query these events to answer arbitrarily complex questions about your system</li>
</ul>
<p>In this vision, traditional metrics (i.e. timeseries emitted directly from your apps) are considerably less valuable, because they, by definition, are stripped of context when they're aggregated into timeseries. Since cardinality is the main driver of cost in a timeseries database, you have to choose which attributes you care about (ensuring they're low/moderate cardinality), and drop the rest. This means you also have to know what questions you'll want to ask of your system <em>in advance of deploying the code</em>.</p>
<p>With wide events, on the other hand, you keep all that rich, high-cardinality data, and then you get to slice and dice it in ways you couldn't have predicted you'd need in advance. You can use wide events to derive new metrics on the fly, and use the rich attributes to dig into specific anomalies in ways that are impractical with pre-aggregated timeseries data.</p>
<p>Furthermore, in Observability 2.0, logs become somewhat redundant with traces, too. Traces are an ideal superset of logs: structured events, but with additional conventions around how they represent units of work and their relationships (i.e. spans with durations, parents, siblings, and children).</p>
<p>This is a pretty compelling vision: one universal source of truth for all your observability data. You don't have to suffer the cost of 3 pillars when you can derive all the same value from wide events, and you get to simplify your whole practice around a smaller, more powerful set of tools.</p>
<p>Rock and roll! This sounds amazing! Sign me up!</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-gap-between-vision-and-reality">The gap between vision and reality<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#the-gap-between-vision-and-reality" class="hash-link" aria-label="Direct link to The gap between vision and reality" title="Direct link to The gap between vision and reality" translate="no">​</a></h3>
<p>To be fair, the idea of Observability 2.0 is something my colleagues at SimpliSafe and I been thinking about for a while now. Charity has written and spoken a whole lot about it, and because of the elegance of the idea (and her fantastic writing), we already understood the core ideas, and had experience with Honeycomb, so we were primed to consider something a bit more forward thinking.</p>
<p>When Honeycomb came back to us with a concrete plan, we realized it was finally time to take off our incrementalist hats and start considering what it would really mean to go all in on traces as the source of truth.</p>
<p>It didn't take long for us to start enumerating concerns.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="in-defense-of-traditional-metrics">In defense of traditional metrics<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#in-defense-of-traditional-metrics" class="hash-link" aria-label="Direct link to In defense of traditional metrics" title="Direct link to In defense of traditional metrics" translate="no">​</a></h3>
<p>I may have oversimplified the argument against metrics; <a href="https://charity.wtf/2022/04/13/the-truth-about-meh-trics/#bmetrics_and_observability_have_different_use_casesb" target="_blank" rel="noopener noreferrer" class="">Charity's actual words</a> are a bit nuanced. For instance, she still sees value in using metrics to monitor infrastructure (especially third party infra). It's just that for the code that you're writing yourself- at the core of your business- and is changing constantly, metrics alone aren't sufficient to understand what's going on.</p>
<p>This is absolutely true. Metrics <em>alone</em> are insufficient to really understand the behavior of your system. You absolutely need high-cardinality data such as logs and traces to answer many types of questions.</p>
<p>But I would argue that traditional metrics are still necessary as a <em>complement</em> to wide events due to the <em>huge and fundamental cost discrepancy</em> between the two.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="lets-talk-about-cost">Let's talk about cost<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#lets-talk-about-cost" class="hash-link" aria-label="Direct link to Let's talk about cost" title="Direct link to Let's talk about cost" translate="no">​</a></h3>
<p>The reason we still need metrics comes down to the <em>enormous</em> discrepancy in cost between metrics vs wide events. Metrics are <em>extraordinarily cheap</em>. Like a <em>fraction</em> of the cost of logs or traces, and the comparison gets more and more one-sided as you scale up.</p>
<ul>
<li class="">Timeseries metrics are incredibly compact and lightweight. Cost is driven by the number of timeseries you're producing: unique combinations of label/value pairs (e.g. <code>pod</code>, <code>request_path</code>, <code>response_code</code>). These pairs get stored once, and then the rest of the cost is basically just a number stored every interval (30 seconds or so).
The tradeoff is that you lose context of the individual events from which the aggregate measurement was derived.</li>
<li class="">Wide events are fundamentally a more heavyweight signal than metrics, since each event is basically its own bag of string attributes. Vendors generally charge for each span/event (or by number of bytes) ingested. You also pay a cost at runtime: there's more data to generate, encode, buffer, and transmit.</li>
</ul>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>wide events, caviar, and private islands</div><div class="admonitionContent_BuS1"><p>And lets just be clear, if money were no object, <strong>of course</strong> I'd rather have full-fidelity wide events than metrics for everything!</p><p>I'd also prefer to dine exclusively in restaurants with Michelin stars, but I don't have the means to do that either.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="lets-compare-costs-as-you-scale">Let's compare costs as you scale<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#lets-compare-costs-as-you-scale" class="hash-link" aria-label="Direct link to Let's compare costs as you scale" title="Direct link to Let's compare costs as you scale" translate="no">​</a></h3>
<p>As the frequency/volume of events increases, the fundamental overhead of wide events grows increasingly impractical in terms of computation, network, and storage... and ultimately dollars. This is for three fundamental reasons:</p>
<ul>
<li class="">metrics are <em>aggregated</em> into a single value over an interval, whereas wide events are not</li>
<li class="">metric labels are stored only once per long-running series, unlike wide events, in which all attribute values are stored in full for every event</li>
<li class="">wide events, by design, carry a lot more data (all the high-cardinality goodness you don't want on your metrics)</li>
</ul>
<p>For example: imagine you have 10 pods which each handle 100 requests per second. You can instrument this with a single histogram metric, at the cost of a few active series (lets say 10 for the sake of argument) per pod. With traces, you're producing <em>100 spans per second, per pod</em> (one per request).</p>
<ul>
<li class="">With metrics, you have 100 active timeseries. Using <a href="https://grafana.com/pricing/" target="_blank" rel="noopener noreferrer" class="">list pricing for Grafana Cloud</a> of $8 for 1K series, this will cost <strong>80 cents per month</strong>.</li>
<li class="">With traces, you're producing <em>1000 spans per second</em>. Using <a href="https://www.honeycomb.io/pricing" target="_blank" rel="noopener noreferrer" class="">list pricing for Honeycomb</a> of 100M events for $130, this will cost <strong>~$3300 per month</strong>.</li>
</ul>
<p>Now let's imagine you find a bottleneck, and optimize this code to be 10x faster. Now each of your 10 pods can handle 1000 requests per second. Awesome! Let's scale up our marketing spend and 10x traffic:</p>
<ul>
<li class="">With metrics, you'll still have 100 active timeseries, at <strong>80 cents per month</strong>.</li>
<li class="">With traces, you're now producing <em>10,000 spans per second</em>, which will cost <strong>~$33K per month</strong>.</li>
</ul>
<p>I wasn't kidding, right? And this is only the cost from Honeycomb; you're also paying the cost in terms of runtime overhead for creating all those spans.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Ingest-time aggregation is a game changer</div><div class="admonitionContent_BuS1"><p>There's now a number of new, ingest-time aggregation tools (e.g. <a href="https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-adaptive-metrics/" target="_blank" rel="noopener noreferrer" class="">Grafana Cloud's Adaptive Metrics</a> or <a href="https://docs.chronosphere.io/control/shaping/rules" target="_blank" rel="noopener noreferrer" class="">Chronosphere's aggregation rules</a>) which can drastically reduce active series by aggregating away unneeded high-cardinality labels.</p><p>In the example above, if you only cared about the request duration percentiles across all pods (and not <em>per pod</em>), you could aggregate away the <code>pod</code> label, resulting in a single histogram for all 10 pods- at a cost of <strong>8 cents per month regardless of how many pods you scale to</strong>.</p><p>We're using Grafana's Adaptive Metrics at SimpliSafe, and it's reduced our metrics bill by about 80%.</p></div></div>
<p>Having the ability to choose metrics vs wide events for a given use-case can give you a lot more flexibility to find a good balance between the richness of telemetry data and its cost.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="wait-so-how-is-anybody-using-wide-events">Wait, so how is <em>anybody</em> using wide events?<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#wait-so-how-is-anybody-using-wide-events" class="hash-link" aria-label="Direct link to wait-so-how-is-anybody-using-wide-events" title="Direct link to wait-so-how-is-anybody-using-wide-events" translate="no">​</a></h3>
<p>You may think the cost scaling example above is contrived (which is true)- you'd probably use a lot more labels on your metrics in practice, which increases the number of series. But calculating the total cardinality you'd generate is complicated, because the interaction between metrics labels produces a sparse matrix; you don't pay for all <em>possible</em> combinations of label values, only the combinations that <em>actually occur</em>.</p>
<p>So let's talk about the real world, specifically telemetry at SimpliSafe. If we were to send <strong>all</strong> of our traces to Honeycomb, it would be an <strong>order of magnitude more expensive</strong> than what we're currently paying for all our timeseries metrics.</p>
<p>But wait, you ask: there are plenty of other companies happily using tracing- how are they affording it?</p>
<p>Yep- they're probably doing what we're doing with our traces: <em>sampling</em>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="parents-have-you-talked-to-your-kids-about-sampling-traces">Parents, have you talked to your kids about sampling traces?<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#parents-have-you-talked-to-your-kids-about-sampling-traces" class="hash-link" aria-label="Direct link to Parents, have you talked to your kids about sampling traces?" title="Direct link to Parents, have you talked to your kids about sampling traces?" translate="no">​</a></h3>
<p>Sampling is a whole topic in itself; the gist is that you randomly select one out of every N traces, drop the rest, and rely on statistical extrapolation when you run queries. The attributes necessary to do sampling are actually built into the OpenTelemetry spec.</p>
<p>There's some tooling available, which can use these attributes, to do <a href="https://opentelemetry.io/docs/concepts/sampling/#tail-sampling" target="_blank" rel="noopener noreferrer" class="">tail sampling</a> (we use Honeycomb's <a href="https://github.com/honeycombio/refinery" target="_blank" rel="noopener noreferrer" class="">Refinery</a> for this). Honeycomb also does a nice job of extrapolating sampled values at query time based on the sample ratio, so it feels pretty opaque to a user- sometimes you forget you're working with sampled spans.</p>
<p>We're currently sampling at a 30/1 ratio, which, at our volume, is <em>usually</em> sufficient to get accurate enough aggregate measurements.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>What's your sampling ratio?</div><div class="admonitionContent_BuS1"><p>I had dinner recently at an Observability community event, and met a nice guy in a similar position to mine. When I found out he was using Honeycomb too, I asked him what sampling ratio they were using.</p><p>After I asked, I suddenly felt awkward, like maybe that was too personal a question to ask someone I'd just met.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/awkward-a45f932c11803ca56abc04354098626d.png" alt="Asking someone what sampling ratio they use with Honeycomb">
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="sampling-is-complicated-and-error-prone">Sampling is complicated and error prone<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#sampling-is-complicated-and-error-prone" class="hash-link" aria-label="Direct link to Sampling is complicated and error prone" title="Direct link to Sampling is complicated and error prone" translate="no">​</a></h4>
<p>Theres a <em>lot</em> of moving parts required to make sampling work at scale. All it takes is one programmer in one service to make a mistake with an attribute that controls sampling behavior, and you can end up with whole classes of missing traces, or conversely, accidentally disabling sampling and blowing up your bill. We've had both of these happen, and in a couple cases, we've had queries return <em>very</em> misleading results.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Mea culpa</div><div class="admonitionContent_BuS1"><p>Just to be clear, these misleading results weren't Honeycomb's fault- it was errors on our side related to the subtleties of OTEL sampling configuration.</p></div></div>
<p>Luckily, in these cases, we had traditional metrics available that directly contradicted the wrong conclusions we were drawing from the traces, and saved us from making some pretty bad decisions.</p>
<p>So if you're doing sampling on your wide events, which you <em>almost certainly are to manage costs</em>, traditional timeseries metrics are <em>extremely valuable</em> for corroborating results from your queries against wide events.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="sampling-breaks-forensic-use-cases">Sampling breaks forensic use cases<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#sampling-breaks-forensic-use-cases" class="hash-link" aria-label="Direct link to Sampling breaks forensic use cases" title="Direct link to Sampling breaks forensic use cases" translate="no">​</a></h4>
<p>The other big problem with sampling is that it makes it virtually impossible to rely on traces for forensic/diagnostic use cases. If you're trying to diagnose an issue for a specific customer, client device, or other specific request, the chances of having a trace available is slim.</p>
<p>Gut check: imagine trying to investigate a potential security breach with anything other than 100% of logs/traces. Who would put themselves in that position?</p>
<p>So for these forensic use cases, we end up needing our unsampled logs, and not traces.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-historical-inertia-of-metrics">The historical inertia of metrics<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#the-historical-inertia-of-metrics" class="hash-link" aria-label="Direct link to The historical inertia of metrics" title="Direct link to The historical inertia of metrics" translate="no">​</a></h3>
<p>I'll finish up my defense of metrics with a couple other <em>eminently practical</em> reasons which make it hard to imagine giving up traditional metrics entirely:</p>
<ul>
<li class="">Metrics are boring, simple, tried and true. Virtually every mature industry uses metrics to drive their business and operations.</li>
<li class="">Metrics are ubiquitous. Everything supports metrics. Most every tool in the cloud-native ecosystem exposes Prometheus metrics, and very few emit traces (though this is slowly changing). Cloud providers also expose telemetry primarily as metrics (e.g. CloudWatch, Azure Monitor, etc).</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="logs-vs-traces">Logs vs traces<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#logs-vs-traces" class="hash-link" aria-label="Direct link to Logs vs traces" title="Direct link to Logs vs traces" translate="no">​</a></h3>
<p>While Charity's Observability 2.0 vision of wide events technically includes structured logs, they generally play second fiddle to traces. Either way, it's worth a quick tangent to enumerate reasons why logs are still valuable:</p>
<ul>
<li class="">Logs are simple and ubiquitous. Everything writes logs. Also all OSS/third-party infra produce logs, and very few produce traces.</li>
<li class="">The developer experience for logs is dead simple, and great by default. Use <code>printf()</code>, start your app from a terminal, and watch the logs flow through stdout. It's a beautiful thing.</li>
<li class="">Compared to traces, logs are a much more natural way to model events which aren't tied to requests made across services (e.g. lifecycle events, background jobs, etc).</li>
<li class="">Logs are mature and battle-tested. They've been used since the dawn of computing.</li>
</ul>
<p>Here's a fun example: If you're using traces <em>only</em>, and are not saving logs, where would you look to diagnose an app crashing? OTEL instrumentation isn't going to politely wrap up the current span and flush its buffer when your app panics. You're just going to lose any record of the cause of the panic.</p>
<p>Here's a few more examples we've experienced where the OTEL tracing ecosystem's maturity has bit us:</p>
<ul>
<li class="">With certain node.js Promise libraries, trace context can get lost across async function chains, leading to orphaned spans</li>
<li class="">Long-running spans (more than a few seconds) are <a href="https://thenewstack.io/opentelemetry-challenges-handling-long-running-spans/" target="_blank" rel="noopener noreferrer" class="">generally an unsolved problem</a>- the behavior isn't well defined, and spans can get dropped or orphaned.</li>
<li class="">It's not currently possible to configure sampling to capture related traces: single logical operations that involve multiple requests, async events, etc.</li>
<li class="">Context propagation can get broken across a whole system by one service that's using an older OTEL SDK, listens via an esoteric protocol, or is just a little buggy</li>
<li class="">Graceful shutdowns are a lot trickier with OTEL tracing, because you have to first drain your actual connections/queues, then ensure you're flushing all in-progress spans before exiting. Any mistakes: you guessed it, dropped spans.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-path-forward-for-observability-20">The path forward for Observability 2.0<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#the-path-forward-for-observability-20" class="hash-link" aria-label="Direct link to The path forward for Observability 2.0" title="Direct link to The path forward for Observability 2.0" translate="no">​</a></h3>
<p>I'm not writing this to dump on either OpenTelemetry, tracing, or the overall vision of Observability 2.0. I'm still a huge fan of the idea of simplifying signals down to minimize telemetry sprawl, and I've experienced the benefits of OTEL, tracing, and wide events firsthand.</p>
<p>Honeycomb is indeed a fucking <em>pleasure</em> to use- it's super fast, very powerful, and intuitive. Honeycomb's mere existence has prompted a virtual tidal wave of innovation across the observability space as competitors struggle to react to its power and elegance.</p>
<p>But despite how seductive the vision of Observability 2.0 is, when it came time to make a bold decision to go all in... we equivocated, put our incrementalist hats back on, and decided to continue with the three pillars for the time being. Observability 2.0 is just a <em>bit too cutting edge</em> at the moment.</p>
<p>The obstacles that prevented us from saying "yes" today may be solvable in time. I hear there are companies that have already gone all in on Observability 2.0, and I've got to believe it's possible.</p>
<p>To close the gap for a company like SimpliSafe, here's the problems we'd need solved:</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="affordability">Affordability<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#affordability" class="hash-link" aria-label="Direct link to Affordability" title="Direct link to Affordability" translate="no">​</a></h4>
<p>Unless we can get the cost of wide events down, sampling will continue to be necessary, and sampling undermines the idea that we can discard the three-pillars model:</p>
<ul>
<li class="">Sampling traces makes them useless for diagnostic and forensic use cases, requiring you to retain logs</li>
<li class="">The complexity of sampling traces makes it more important to retain traditional metrics to corroborate results</li>
</ul>
<p>This is a fundamental catch-22: <strong>sampling and Observability 2.0 are fundamentally incompatible</strong>.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Just to be clear</div><div class="admonitionContent_BuS1"><p>I think that in our three-pillars world, sampling traces is actually a <em>great tradeoff</em>. You get most of the value, and you pay a fraction of the cost.</p><p>But I'm only OK with this tradeoff because I have <em>metrics and logs to cover the gaps</em>.</p></div></div>
<p>I'm seeing some pretty exciting things happening to make logs/traces more cost-efficient, such as products based on OSS columnar databases like <a href="https://github.com/ClickHouse/ClickHouse" target="_blank" rel="noopener noreferrer" class="">Clickhouse</a>, as well as other tracing projects like <a href="https://grafana.com/oss/tempo/" target="_blank" rel="noopener noreferrer" class="">Grafana Tempo</a> and <a href="https://quickwit.io/" target="_blank" rel="noopener noreferrer" class="">QuickWit</a>.</p>
<p>And I'm certainly not going to count out Honeycomb- knowing them, they'll continue to chip away at inefficiencies and find more ways to make their product more and more affordable, especially as they gain greater economies of scale.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="developer-experience">Developer experience<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#developer-experience" class="hash-link" aria-label="Direct link to Developer experience" title="Direct link to Developer experience" translate="no">​</a></h4>
<p>The developer tools available for tracing are legitimately a lot more complex than those available for logs. We'll need ubiquitous tracing developer tools that approach the ease of use of <code>printf()</code> debugging, and that make it just as easy to validate that your tracing instrumentation is working the way you intended.</p>
<p>The developer experience for OTEL SDK configuration is still... a work in progress. I hope to see more projects that package up the OTEL SDKs as a  "distribution" to provide an installation experience that's as simple as vendor agents like NewRelic/DataDog's. Let's just admit that 200 lines of boilerplate before you can send a single span is a bit much.</p>
<p>Documentation is also a big opportunity for improvement. I found it impossible to get an SDK properly configured without digging through the source code. We need <em>a lot</em> more examples, and more focus on use-cases.</p>
<p>And please, for the love of god, let's not add <a href="https://github.com/open-telemetry/opentelemetry-configuration/" target="_blank" rel="noopener noreferrer" class="">another layer of YAML</a> to "simplify" configuration.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>My new years resolution</div><div class="admonitionContent_BuS1"><p>Instead of just complaining about OTEL SDK configuration and docs, I'm going to shut up and start contributing.</p><p>OTEL is amazing, thank you to everyone who's worked on the project. We appreciate you!</p></div></div>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="reliability-and-maturity">Reliability and maturity<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#reliability-and-maturity" class="hash-link" aria-label="Direct link to Reliability and maturity" title="Direct link to Reliability and maturity" translate="no">​</a></h4>
<p>Beyond installation and configuration, the OTEL tracing ecosystem needs some time to ripen, making it easier to get correct behavior (e.g. graceful shutdowns, context propagation, etc), handling edge cases (crashes/panics), and defining some standards for handling things like long-running spans, related traces, and unfinished spans.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="ecosystem-and-adoption">Ecosystem and adoption<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#ecosystem-and-adoption" class="hash-link" aria-label="Direct link to Ecosystem and adoption" title="Direct link to Ecosystem and adoption" translate="no">​</a></h4>
<p>As long as most infrastructure and third-party tools are emitting only logs and metrics, it's going to be hard to go all in on tracing for first-party telemetry. It's possible that OpenTelemetry's trajectory will continue, and more and more OSS tools will start emitting traces, but there's a lot of ground to cover to hit critical mass.</p>
<p>As a litmus test: an Observability 2.0 product would need some sort of drop-in Kubernetes infrastructure monitoring solution similar to what you can get out-of-the-box with DataDog/NewRelic, or at least something comparable to the <a href="https://github.com/prometheus-operator/prometheus-operator" target="_blank" rel="noopener noreferrer" class="">Prometheus Operator</a>, which are currently almost entirely driven by metrics.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="being-comfortable-with-incrementalism">Being comfortable with incrementalism<a href="https://labaneilers.com/are-we-ready-for-observability-2.0#being-comfortable-with-incrementalism" class="hash-link" aria-label="Direct link to Being comfortable with incrementalism" title="Direct link to Being comfortable with incrementalism" translate="no">​</a></h2>
<p>Throughout my career, I've generally been biased towards incrementalism over revolutionary changes. I'm pretty stingy about my <a href="https://mcfunley.com/choose-boring-technology" target="_blank" rel="noopener noreferrer" class="">innovation tokens</a>, and I tend to want to save them for things that will drive our core business strategy. Observability is something I want to keep boring and predictable.</p>
<p>Perhaps Observability 2.0 isn't a purist standard we should aim to achieve in black and white terms. Rather, it's a philosophy we can apply incrementally, in which we gradually get more and more value from our observability spend.</p>
<p>Right now, we're thinking about how to better use the three pillars we have:</p>
<ul>
<li class="">Seek out tools that integrate and visualize data from across multiple pillars (e.g. using exemplars to link metrics to traces)</li>
<li class="">Look for ways to reduce the waste of overlapping, duplicate pillars</li>
<li class="">Try to utilize each pillar for its strengths, with tooling that uses them to complement each other</li>
<li class="">Look for more sustainable ways to manage costs</li>
</ul>
<p>Hopefully in a few years, the ecosystem will have evolved around the Observability 2.0 vision, and we'll be in a position to be a bit braver about our next steps.</p>]]></content:encoded>
            <category>observability</category>
            <category>opentelemetry</category>
            <category>devops</category>
            <category>platform-engineering</category>
        </item>
        <item>
            <title><![CDATA[Multi-platform container builds with BuildKit]]></title>
            <link>https://labaneilers.com/multi-platform-builds-buildkit</link>
            <guid>https://labaneilers.com/multi-platform-builds-buildkit</guid>
            <pubDate>Thu, 02 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[At SimpliSafe, we wanted to take advantage of the cost savings and performance improvements of AWS's Graviton processors (ARM64), but wanted to do it incrementally to manage risk.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>At SimpliSafe, we wanted to take advantage of the cost savings and performance improvements of AWS's Graviton processors (ARM64), but wanted to do it incrementally to manage risk.</p><p>We built an autoscaling, docker-compatible build service using BuildKit, which could build multi-platform container images, and then used Karpenter to auto-provision Graviton-based nodes.</p><p>This is paying off- we're seeing the expected 30% better performance per EC2 dollar spent, as well as some surprising benefits to developer productivity.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/polar-bear-fb406011807a77fe81057296532a3b27.jpg" class="blog-image" alt="A polar bear using an ARM64 laptop">
<p>AWS's Graviton is a 64 bit ARM-based CPU available on EC2. Why, you ask, would one want to use Graviton-based instances when trusty old x86 instances have served us so well in the past?</p>
<p>Well, for one, you'ds see a roughly <strong>30% improvement in price/performance</strong> versus instances with x86 chips. This performance difference is even more pronounced at higher utilization, because unlike x86 chips, a Graviton vCPU is an <em>actual</em> CPU core, NOT a hyperthread you're sharing on a core with some rando. This should allow you to scale down, increase CPU utilization more than would be safe with hyperthreads, and still handle the same load.</p>
<!-- -->
<p>While you won't, as an AWS customer, see the corresponding reduction in power consumption and carbon emissions <em>directly</em>, you can be assured it correlates directly with cost savings. You can save money whilst congratulating yourself for being a dedicated protector of the environment. Bonus!</p>
<p>I had a conversation with my team, and we decided that yes, we do enjoy saving money, and that we should probably look at how hard it would be to migrate our workloads to Graviton instances.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-hard-could-recompiling-everything-be">How hard could recompiling everything be?<a href="https://labaneilers.com/multi-platform-builds-buildkit#how-hard-could-recompiling-everything-be" class="hash-link" aria-label="Direct link to How hard could recompiling everything be?" title="Direct link to How hard could recompiling everything be?" translate="no">​</a></h2>
<p>At SimpliSafe, we've got a heterogenous fleet of microservices built with a bunch of different languages. While we were eager to take advantage of the cost savings of Graviton instances, we weren't about to plunge headlong before we'd incrementally built some confidence with it. We needed an approach that would support both ARM64 and AMD64 for some period, so that we could run a reasonable bake-off.</p>
<p>Another wrinkle: we'd been gradually replacing our developer laptops (Macbooks) with newer models with Apple Silicon (ARM) processors for about a year at this point, so we had a mix of developers using ARM and Intel chips. We knew that within about 2 years, all our developers would be on ARM, as their laptops were refreshed.</p>
<p>We believe strongly that developers need be able to build, test, and debug locally on their laptops, in an environment that's as close as possible to production. Whatever solution we came up with had to work for both x86 and ARM, both in AWS and on developer laptops.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="options-for-multi-platform-builds">Options for multi-platform builds<a href="https://labaneilers.com/multi-platform-builds-buildkit#options-for-multi-platform-builds" class="hash-link" aria-label="Direct link to Options for multi-platform builds" title="Direct link to Options for multi-platform builds" translate="no">​</a></h2>
<p>The first problem we had to solve was building cross-platform (or better yet, multi-platform) container images. Here's some of the options we considered:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="docker-with-qemu-emulation">Docker with QEMU emulation<a href="https://labaneilers.com/multi-platform-builds-buildkit#docker-with-qemu-emulation" class="hash-link" aria-label="Direct link to Docker with QEMU emulation" title="Direct link to Docker with QEMU emulation" translate="no">​</a></h3>
<p>Docker has features to enable CPU emulation of ARM chips on x86 hosts, or vice versa. On Linux, it uses QEMU, while on a Mac (using a Linux VM managed by Docker Desktop) it can additionally use Rosetta.</p>
<p>It <em>usually</em> works pretty well, but it's far from perfect. There's a number of cases we ran into, especially with Rust and .NET builds, where the compiler would choke hard (segfault) when run under emulation.</p>
<p>Even if emulation had been reliable, the performance hit when running under emulation varied from slightly annoying (e.g. node.js, Python) to mildly aggravating (Go, .NET), to unbearably slow (Rust, C++).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="cross-compilation">Cross compilation<a href="https://labaneilers.com/multi-platform-builds-buildkit#cross-compilation" class="hash-link" aria-label="Direct link to Cross compilation" title="Direct link to Cross compilation" translate="no">​</a></h3>
<p>Many modern language compilers support cross-compilation (i.e. you can compile a binary for a different architecture than the one the compiler is running on). We toyed around with this option, but quickly realized what a mess it would be, given the breadth of languages/toolchains we use.</p>
<p>To make this work, you have to build in a container running on the host's architecture, and then copy the artifacts into another container with the target architecture. This would be a huge step down from the elegance of multi-stage Dockerfiles, in which you can encapsulate an arbitrarily complex build toolchain and produce a final image from a single Dockerfile and single build command. From the perspective of CI/CD and our developer tools, cross-compilation would require different strategies for different languages, where it had previously been opaque to the tooling.</p>
<p>On top of all this, cross-compilation <em>still</em> requires emulation to build the final output image, even if compilation isn't actually occurring via emulation. We'd have to make sure our CI/CD runners were available on both ARM and AMD, and that each app's image was built on the right type of runner.</p>
<p>We also had to support developers building locally from <em>either</em> ARM or Intel laptops, which means we'd need to invert the build steps' target platforms depending on the architecture of the laptop.</p>
<p>Yuck.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="multi-platform-container-images">Multi-platform container images<a href="https://labaneilers.com/multi-platform-builds-buildkit#multi-platform-container-images" class="hash-link" aria-label="Direct link to Multi-platform container images" title="Direct link to Multi-platform container images" translate="no">​</a></h2>
<p>With either of these options, we <em>still</em> haven't gained the capability to build multi-platform images. With traditional Docker builds, you can only build single-architecture images. We really wanted to run per-service canary deployments to manage the risk of migration, and having to manage two separate per-architecture tagging schemes would have leaked a lot of complexity from the build into the rest of our deployment system.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>What are multi-platform container images?</div><div class="admonitionContent_BuS1"><p>Multi-platform container images are a pretty neat hack in the <a href="https://github.com/opencontainers/distribution-spec/blob/v1.1.0/spec.md#listing-referrers" target="_blank" rel="noopener noreferrer" class="">OCI spec</a> in which a specific image tag/digest points to a multi-platform "image index" instead of the manifest (list of layers) you'd get with a single-architecture image.</p><p>This index is an additional layer of indirection: a list of digests pointing to manifests for different architectures. It allows a container runtime (e.g. Docker, containerd, cri-o, etc) to automatically choose the right image for the architecture it's running on.</p></div></div>
<p>Luckily, there's a number of container build tools available (e.g <a href="https://github.com/containers" target="_blank" rel="noopener noreferrer" class="">Buildah/Podman</a>, <a href="https://github.com/GoogleContainerTools/kaniko" target="_blank" rel="noopener noreferrer" class="">Kaniko</a>) that do support multi-platform builds.</p>
<p>Given that we were already pretty invested in Docker as our build tool (both locally and in CI/CD), we took a look at Docker's <a href="https://github.com/moby/buildkit" target="_blank" rel="noopener noreferrer" class="">BuildKit</a>- which as it turned out, solved all these problems in one fell swoop.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="buildkit-builders-what-is-this-sorcery">BuildKit builders? What is this sorcery?<a href="https://labaneilers.com/multi-platform-builds-buildkit#buildkit-builders-what-is-this-sorcery" class="hash-link" aria-label="Direct link to BuildKit builders? What is this sorcery?" title="Direct link to BuildKit builders? What is this sorcery?" translate="no">​</a></h3>
<p>Docker has been rearchitecting its build engine for a few years now, replacing its legacy build engine with <a href="https://github.com/moby/buildkit" target="_blank" rel="noopener noreferrer" class="">BuildKit</a>, and extending the client interface with the <a href="https://docs.docker.com/reference/cli/docker/buildx/" target="_blank" rel="noopener noreferrer" class="">buildx</a> subcommand. Docker now has the ability to orchestrate builds using different drivers, allowing you to use one or more <a href="https://docs.docker.com/build/builders/" target="_blank" rel="noopener noreferrer" class="">builders</a> running on a potentially separate host. You define a builder with some docker commands, giving the docker client the info it needs to utilize the (possibly remote) builder running BuildKit.</p>
<p>The <code>docker buildx build</code> command establishes a network connection with the BuildKit instance, passes container registry auth info, copies the build context (source files, etc), executes builds for any number of platforms in parallel, pushes the output image to a registry, and interleaves the output streams from the builders back to the <code>docker</code> client.</p>
<p>Remote builders support the <strong>full feature set of Docker</strong>; there's no loss of functionality. We were a little worried we wouldn't be able to use some of the more advanced features, like secret mounts or ssh-agent sockets... but it turns out it worked 100% seamlessly.</p>
<p>After some experimentation, we found we could define a pair of builders (for ARM64 and AMD64), and construct a <code>docker buildx build</code> command that would reliably produce multi-platform images, without relying on either emulation or cross-compilation; all with only slightly longer build times than building a single, native-architecture image locally. All this could be done with a single Dockerfile!</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Multi-platform Dockerfiles</div><div class="admonitionContent_BuS1"><p>It turns out it required very few changes to our Dockerfiles to support multi-platform builds. If your <code>FROM</code> declarations use tags pointing to multi-platform base images, BuildKit automatically detects and pulls the image for the correct architecture.</p><p>If like us, you pin your base images in <code>FROM</code> declarations to a specific digest, you just need to make sure the digest points to the multi-platform index (not the digest for a platform-specific manifest). Here's a quick docker CLI command to get that:</p><div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">docker buildx imagetools inspect &lt;image-name&gt;:&lt;tag&gt;</span><br></span></code></pre></div></div><p>The very first digest in the output is for the multi-platform index.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="trying-out-the-kubernetes-driver">Trying out the kubernetes driver<a href="https://labaneilers.com/multi-platform-builds-buildkit#trying-out-the-kubernetes-driver" class="hash-link" aria-label="Direct link to Trying out the kubernetes driver" title="Direct link to Trying out the kubernetes driver" translate="no">​</a></h2>
<p>Our first iteration used Docker's <a href="https://docs.docker.com/build/builders/drivers/kubernetes/" target="_blank" rel="noopener noreferrer" class="">kubernetes driver</a>, which given a kubeconfig context and a few other settings (e.g. resource requests/limits), will spin up BuildKit pods, and use the Kubernetes API to exec into them and kick off processes with buildkit CLI commands.</p>
<p>We discovered some limitations to this approach; most notably:</p>
<ul>
<li class="">developers (and CI/CD) needed to authenticate to a Kubernetes cluster to execute builds, which added some complexity and new surface area from a security perspective.</li>
<li class="">the driver doesn't do any intelligent load balancing between the builders; the only available load balancing algorithms are <code>random</code> and <code>sticky</code>, neither of which do much to prevent a single builder from being overwhelmed.</li>
<li class="">there's no autoscaling capability. We would have had to choose between wasting money on over-provisioning builders, or risk builds failing due to lack of capacity.</li>
</ul>
<p>Not ideal.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="creating-a-central-builder-service">Creating a central builder service<a href="https://labaneilers.com/multi-platform-builds-buildkit#creating-a-central-builder-service" class="hash-link" aria-label="Direct link to Creating a central builder service" title="Direct link to Creating a central builder service" translate="no">​</a></h2>
<p>We quickly pivoted and tried a different approach: using Docker's <a href="https://docs.docker.com/build/builders/drivers/remote/" target="_blank" rel="noopener noreferrer" class="">remote driver</a> with a centralized deployment of BuildKit pods behind a network load balancer.</p>
<p>This worked amazingly well; it solved all the scaling and auth problems we'd had with the kubernetes driver, drastically simplifying the client build tooling in the process.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>BuildKit autoscaling is little tricky</div><div class="admonitionContent_BuS1"><p>To get autoscaling working reliably, we did have to customize the <code>buildkitd</code> image to handle graceful shutdowns (i.e. we added a <code>prestop</code> hook to delay termination until all builds on a particular pod were complete).</p><p>The key buildkit command to use to determine the status of all builds is:</p><div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">buildctl debug histories</span><br></span></code></pre></div></div></div></div>
<p>When this was all done, we could use a single command in our developer tooling to build multi-platform images, regardless of the platform of the host laptop or CI/CD runner.</p>
<p>Great success!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="so-you-decided-you-didnt-need-layer-caching">So you decided you didn't need layer caching?<a href="https://labaneilers.com/multi-platform-builds-buildkit#so-you-decided-you-didnt-need-layer-caching" class="hash-link" aria-label="Direct link to So you decided you didn't need layer caching?" title="Direct link to So you decided you didn't need layer caching?" translate="no">​</a></h2>
<p>The astute reader will have guessed that Docker's <em>extremely valuable</em> layer caching feature probably wouldn't work when running a build on a different host, especially one that's randomly selected by a load balancer. That was very clever of you to notice!</p>
<p>Luckily, BuildKit has a feature called <a href="https://docs.docker.com/build/cache/backends/registry/" target="_blank" rel="noopener noreferrer" class="">registry-based layer caching</a> that allows you to cache layers in a external container registry. This works just like traditional layer caching, but the cache layers are stored in a specially formatted OCI image in the registry (we're using Amazon ECR).</p>
<p>This required some additional complexity to our build tooling to make sure we were tagging and pushing cache images with the right settings (e.g. using the <code>max</code> mode to support multi-stage builds), but once that was done it was completely invisible to developers.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="putting-this-all-together">Putting this all together<a href="https://labaneilers.com/multi-platform-builds-buildkit#putting-this-all-together" class="hash-link" aria-label="Direct link to Putting this all together" title="Direct link to Putting this all together" translate="no">​</a></h2>
<p>We now had a reliable way to build multi-platform images built into our existing developer tooling, but we still needed a way to configure our pods to run on Graviton instances.</p>
<p>It turns out this was pretty trivial with <a class="" href="https://labaneilers.com/karpenter-you-complete-me">Karpenter</a>. We created a new <code>NodePool</code> and <code>EC2NodeClass</code> for Graviton instance types, and added a well-known taint and label (<code>kubernetes.io/arch</code>) our pod specs could target with a corresponding toleration and node selector.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>A taint <em>and</em> a label?</div><div class="admonitionContent_BuS1"><p>Yep, you need both. The taint/toleration prevents any AMD64 workloads from being scheduled on an ARM64 node, and the label/nodeSelector ensures that ARM64 workloads can <em>only</em> be scheduled on an ARM64 node.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="results">Results<a href="https://labaneilers.com/multi-platform-builds-buildkit#results" class="hash-link" aria-label="Direct link to Results" title="Direct link to Results" translate="no">​</a></h2>
<p>In the end, we are generally seeing the savings/performance improvements we expected, though it varies slightly by workload. We definitely haven't encountered a workload that performs <em>worse</em> on Graviton than on Intel/AMD.</p>
<p>Don't get me wrong, the savings have been really nice, but the improvement to developer productivity has actually been a lot bigger than I expected.</p>
<p>As more of our developers were getting refreshed to new Apple Silicon laptops, building with emulation was becoming an increasingly huge pain (since we were previously always building AMD64 images). Ironically, having the new remote builders allowed us to safely and incrementally convert our target architecture in production, which in turn allowed our developers to switch to running native ARM64 builds locally... and not use the remote builders.</p>
<p>And damn, these new Macbooks run native builds <em>really fast</em>.</p>
<p>We still use the build service for CI/CD builds, since our CI/CD runners are still all AMD64, but our developer build tooling can decide on remote vs local builds automatically by detecting the host and target architecture. We haven't seen any differences in the resulting artifacts between local and remote builds, thanks to the the fact that BuildKit works exactly the same locally vs remotely.</p>
<p>So go out there, save some money, some carbon, and hopefully some penguins.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="special-thanks">Special thanks<a href="https://labaneilers.com/multi-platform-builds-buildkit#special-thanks" class="hash-link" aria-label="Direct link to Special thanks" title="Direct link to Special thanks" translate="no">​</a></h2>
<p>Special thanks to <a href="https://www.linkedin.com/in/efong/?originalSubdomain=au" target="_blank" rel="noopener noreferrer" class="">Liz Fong-Jones</a>, whose <a href="https://www.youtube.com/watch?v=vSdScyCFsFI&amp;ab_channel=AWSEvents" target="_blank" rel="noopener noreferrer" class="">awesome talk at AWS re<!-- -->:invent<!-- --> (2024)</a> about migrating Honeycomb's Lambda-based architecture to Graviton inspired me to finally write this up.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="appendix">Appendix<a href="https://labaneilers.com/multi-platform-builds-buildkit#appendix" class="hash-link" aria-label="Direct link to Appendix" title="Direct link to Appendix" translate="no">​</a></h2>
<p>An example <code>docker buildx</code> command for building a multi-platform image:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">docker buildx build \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --builder remote \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --push \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --pull \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --platform 'linux/amd64,linux/arm64' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --tag '111111111.dkr.ecr.us-east-1.amazonaws.com/my-app:some-tag' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --cache-from 'type=registry,ref=111111111.dkr.ecr.us-east-1.amazonaws.com/my-app:some-tag-cache' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --cache-from 'type=registry,ref=111111111.dkr.ecr.us-east-1.amazonaws.com/my-app:some-tag-cache' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    --cache-to 'type=registry,mode=max,ref=111111111.dkr.ecr.us-east-1.amazonaws.com/my-app:some-tag-cache,image-manifest=true' \</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    .</span><br></span></code></pre></div></div>
<p>A few notes:</p>
<ul>
<li class="">The <code>--push</code> flag tells Docker to push the resulting image to the registry, since it can't necessarily store the multi-platform image in your local Docker daemon's cache (unless you have containerd enabled... long story).</li>
<li class="">The multiple <code>--cache-from</code> flags allow us to use the cache from a previous build from either the main branch or a feature branch. With <code>--pull</code>, Docker will pull both and automatically select the right one, and will also fail gracefully if one doesn't exist.</li>
<li class=""><code>--cache-to</code> tells Docker to store the resulting cache of this build. The <code>mode=max</code> attribute is very important here, since otherwise the cache would only contain the final stage's layers.</li>
<li class="">The <code>--builder</code> flag tells Docker to use the remote builder named <code>remote</code>, which is defined in a file created by our tooling at <code>~/.docker/buildx/instances/remote</code>, and looks like this:</li>
</ul>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"Name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"remote"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"Driver"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"remote"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"Nodes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"remote-arm64"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Platforms"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token property" style="color:#36acaa">"architecture"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"arm64"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token property" style="color:#36acaa">"os"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"linux"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Flags"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token null keyword" style="color:#00009f">null</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"DriverOpts"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token null keyword" style="color:#00009f">null</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Endpoint"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"tcp://arm.docker-builders.mycompany.com:3569"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Files"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token null keyword" style="color:#00009f">null</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"remote-amd64"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Platforms"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token property" style="color:#36acaa">"architecture"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"amd64"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token property" style="color:#36acaa">"os"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"linux"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Flags"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token null keyword" style="color:#00009f">null</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"DriverOpts"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token null keyword" style="color:#00009f">null</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Endpoint"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"tcp://amd.docker-builders.mycompany.com:3569"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"Files"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token null keyword" style="color:#00009f">null</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"Dynamic"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">false</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>]]></content:encoded>
            <category>devops</category>
            <category>platform-engineering</category>
            <category>kubernetes</category>
        </item>
        <item>
            <title><![CDATA[Incremental IPv6 with Kubernetes]]></title>
            <link>https://labaneilers.com/incremental-ipv6-with-kubernetes</link>
            <guid>https://labaneilers.com/incremental-ipv6-with-kubernetes</guid>
            <pubDate>Sun, 13 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Due to looming IP address exhaustion, we've been migrating my company's Kubernetes workloads to IPv6. While IPv6 has its sharp edges, AWS EKS's new IPv6-only mode and better OSS ecosystem support has made it possible to adopt incrementally.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>Due to looming IP address exhaustion, we've been migrating my company's Kubernetes workloads to IPv6. While IPv6 has its sharp edges, AWS EKS's new IPv6-only mode and better OSS ecosystem support has made it possible to adopt incrementally.</p><p>Here's a bunch of tricks I've picked up in the process.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/cars-8bb004502b17e969d919cf27ce607878.jpg" class="blog-image" alt="An full parking lot">
<p>At my work, we've been struggling a bit over the past few years with decisions made (almost 10 years ago now) about our AWS network design. While we have a full class A private network (16,777,216 IPv4 addresses), we've managed to paint ourselves into the very sad corner of looming IP address exhaustion.</p>
<p>There's a few reasons:</p>
<ul>
<li class="">Our integration with cell network carriers (to support our <a href="https://simplisafe.com/build-my-system" target="_blank" rel="noopener noreferrer" class="">home security systems</a>) requires a huge chunk of our IP space</li>
<li class="">Our decision to use a multi-account architecture in AWS, and that we chose to use a flat IP space across our accounts. This means our IP space is fragmented across accounts, regions, and availability zones, making a lot of that address space effectively unusable.</li>
</ul>
<p>Even with all of this, we might have been fine... until we went big on Kubernetes.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubernetes-eats-ips-for-breakfast">Kubernetes eats IPs for breakfast<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#kubernetes-eats-ips-for-breakfast" class="hash-link" aria-label="Direct link to Kubernetes eats IPs for breakfast" title="Direct link to Kubernetes eats IPs for breakfast" translate="no">​</a></h2>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/chipmunk-0579dd05797ec2db4c4f3d1dba822110.jpg" alt="Kubernetes eating all my IPs"><figcaption>Kubernetes eating all my IPs</figcaption></figure>
<p>Kubernetes has been a huge win for us. But it gobbles up IP addresses like Pac-Man with a tapeworm.</p>
<p>It's fairly straightforward math: in AWS EKS, with the VPC CNI integration (i.e. a network plugin for Kubernetes that allows it to integrate with AWS's networking APIs), here's what happens to all your IPs:</p>
<ul>
<li class="">The EKS control plane requires at least 16 addresses (at least 6 per subnet)</li>
<li class="">Every node (EC2 instance) requires at least one address, but depending on your CNI settings, the CNI plugin can eagerly allocate additional addresses to keep "warm" (to speed up pod creation)</li>
<li class="">Every pod on a node gets its own IP address. This includes not only user workloads, but also <strong>every daemonset pod</strong>. In our cluster, we have at about 8-10 daemonset pods per node.</li>
</ul>
<p>This means, as we've migrated workloads to Kubernetes, we've increased the number of IPs we're using by roughly <strong>10x</strong>.</p>
<p>This adds up quickly. We've had a few close calls with IPv4 exhaustion during high traffic events where we had to scramble to temporarily kill non-critical workloads to free up IPs, rebalance across availability zones, or provision new subnets to make sure customers weren't affected.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="actually-ipv6-is-a-thing">Actually, IPv6 is a thing<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#actually-ipv6-is-a-thing" class="hash-link" aria-label="Direct link to Actually, IPv6 is a thing" title="Direct link to Actually, IPv6 is a thing" translate="no">​</a></h2>
<p>Unlike IPv4, IPv6 address space is so incomprehensibly large that it's effectively unlimited. For example, a typical IPv6 <em>private subnet</em> would have a <code>/64</code> IPv6 CIDR, which is <strong>18,446,744,073,709,551,616</strong> addresses.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>fun fact</div><div class="admonitionContent_BuS1"><p>Apparently a number with that many digits is called a "vigintillion". Numbers this large can only be discussed using your best Carl Sagan voice.</p></div></div>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/carl-1283583d6ba689ceae8005aff6e617f8.webp" alt="Trillions and Trillions of IPs"><figcaption>Trillions and Trillions of IPs</figcaption></figure>
<p>IPv6 has been a standard for like 25 years, but is still not widely adopted (for a lot of reasons, including backward-incompatibility, lack of ecosystem support, and ISPs squabbling and dragging their feet).</p>
<p>It's legitimately really difficult to migrate a large distributed architecture like ours to IPv6, because, historically, it would require simultaneous changes across many different systems, along with some scary big-bang moments. It also requires reconsidering a lot of assumptions built into your network design and security strategy.</p>
<p>It's been hard to figure out how to untangle that knot.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="enter-eks-ipv6-mode">Enter EKS IPv6 mode<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#enter-eks-ipv6-mode" class="hash-link" aria-label="Direct link to Enter EKS IPv6 mode" title="Direct link to Enter EKS IPv6 mode" translate="no">​</a></h2>
<p>Given the scarcity (and price) of public IPv4 addresses, and to support the increasing scale of its customers, AWS has been under a lot of pressure to provide more viable paths to adopting IPv6. In one of the smartest moves I've seen from them in a while, they've used Kubernetes's built-in IPv6 support to build a new <a href="https://aws.github.io/aws-eks-best-practices/networking/ipv6/" target="_blank" rel="noopener noreferrer" class="">IPv6 mode for EKS</a>.</p>
<p>Here's the core of the hack: <em>While each node continues to get an IPv4 address, pods get <strong>only</strong> IPv6 addresses.</em></p>
<p>Inside the cluster, all traffic is via IPv6, but traffic to and from the cluster gets NATed through the nodes' IPv4 addresses. From the perspective of anything outside the cluster, connections appear to be coming from the nodes' IPv4 addresses. This means only the software <em>inside</em> the cluster has to be modified to use IPv6.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p>Note that if a host outside the cluster is IPv6-enabled, pods may just communicate directly with it over IPv6, and bypass the IPv4 NAT.</p></div></div>
<p>This translation of IP version between inside and outside the cluster has allowed us to migrate our workloads incrementally, which has made the whole process much more tractable.</p>
<p>Migrating only EKS workloads, alone, looks like it's going to allow us to reduce IPv4 address usage significantly, perhaps enough to solve our IPv4 exhaustion without any further network changes. Even if not, it should buy us years of additional runway before we hit that point.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-does-migrating-an-eks-cluster-to-ipv6-require">What does migrating an EKS cluster to IPv6 require?<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#what-does-migrating-an-eks-cluster-to-ipv6-require" class="hash-link" aria-label="Direct link to What does migrating an EKS cluster to IPv6 require?" title="Direct link to What does migrating an EKS cluster to IPv6 require?" translate="no">​</a></h2>
<p>Unfortunately, you can't enable IPv6 mode on an existing EKS cluster; you have to create a new cluster and migrate your workloads over. There have been a bunch of specific challenges around this (mostly just minutiae around Terraform wrangling and executing DNS cutovers), but now that we've found most of the corner cases, the process is pretty mechanical.</p>
<p>The bulk of the remaining work is around making any code or configuration changes necessary in the individual workloads to get them to bind to IPv6 addresses.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-few-basics">A few basics<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#a-few-basics" class="hash-link" aria-label="Direct link to A few basics" title="Direct link to A few basics" translate="no">​</a></h2>
<p>I've been a programmer for like 30 years, and I had never done anything with IPv6 before this migration. There were a few embarrassingly basic things I had to learn about IPv6:</p>
<ul>
<li class="">The IPv6 "all interfaces" address is <code>::</code>, which is equivalent to <code>0.0.0.0</code> in IPv4.</li>
<li class="">The IPv6 loopback address is <code>::1</code>, equivalent to <code>127.0.0.1</code> in IPv4.</li>
<li class="">URLs that use an IPv6 address as the hostname need the address enclosed in square brackets, e.g. <code>http://[2001:db8::1]:8080</code> so the colons in the address don't get confused with the port delimiter.</li>
<li class=""><a href="https://en.wikipedia.org/wiki/Happy_Eyeballs" target="_blank" rel="noopener noreferrer" class="">Happy Eyeballs</a> is an algorithm (implemented by most network clients) that allows apps (including browsers) to efficiently decide whether to use an IPv6 or IPv4 address when both are advertised via DNS.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="your-os-and-language-probably-supports-ipv6">Your OS and language probably supports IPv6<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#your-os-and-language-probably-supports-ipv6" class="hash-link" aria-label="Direct link to Your OS and language probably supports IPv6" title="Direct link to Your OS and language probably supports IPv6" translate="no">​</a></h2>
<p>One cool thing is that almost all modern OSes (Linux, Mac, Windows) support "dual-stack": they can listen on a port on both IPv6 and IPv4 from a single socket.</p>
<p>On top of this, most high-level programming languages (and their standard libraries) utilize this feature, so if you bind to the <code>::</code> (all interfaces) address, you'll be able to listen on both IPv4 and IPv6 at the same time.</p>
<p>For example, in node.js:</p>
<div class="language-javascript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-javascript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> http </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">require</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'http'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">const</span><span class="token plain"> server </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> http</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">createServer</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">(</span><span class="token parameter">req</span><span class="token parameter punctuation" style="color:#393A34">,</span><span class="token parameter"> res</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token arrow operator" style="color:#393A34">=&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  res</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">writeHead</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">200</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> </span><span class="token string-property property" style="color:#36acaa">'Content-Type'</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'text/plain'</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  res</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">end</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">'Hello World!\n'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">// binds to port 8080 on all IPv6 and IPv4 interfaces by default!</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">server</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">listen</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">8080</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></span></code></pre></div></div>
<p>Or you can do it explicitly:</p>
<div class="language-javascript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-javascript codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// Or you can do the same thing explicitly:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">server</span><span class="token punctuation" style="color:#393A34">.</span><span class="token method function property-access" style="color:#d73a49">listen</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">8080</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'::'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">;</span><br></span></code></pre></div></div>
<p>Here's the same basic thing in Go:</p>
<div class="language-go codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-go codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">package</span><span class="token plain"> main</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"net"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">func</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">main</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// Binds to all IPv6 and IPv4 interfaces.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// Note the square brackets around the address, since the </span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// interface is a subset of a URL.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    listener</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> err </span><span class="token operator" style="color:#393A34">:=</span><span class="token plain"> net</span><span class="token punctuation" style="color:#393A34">.</span><span class="token function" style="color:#d73a49">Listen</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"tcp"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"[::]:8080"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic">// ...</span><br></span></code></pre></div></div>
<p>The same thing is true in .NET, Python, Rust, Java and probably most other languages that aren't doing something weird in their networking implementation.</p>
<p>Of course, most languages also have lower level networking APIs that are IP version specific. If you're doing more complicated things with sockets, you may have a little more work to do.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="unfortunately-not-all-apps-use-dual-stack-by-default">Unfortunately, not all apps use dual-stack by default<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#unfortunately-not-all-apps-use-dual-stack-by-default" class="hash-link" aria-label="Direct link to Unfortunately, not all apps use dual-stack by default" title="Direct link to Unfortunately, not all apps use dual-stack by default" translate="no">​</a></h2>
<p>Even though IPv6 support is readily available in most OSes and languages, it's not always enabled by default in every application. This was particularly annoying for us, because we use a lot of OSS and 3rd party container images as mock dependencies for integration tests, and supporting IPv6 meant we had to add explicit configuration for in a lot of places where we previously just used the defaults.</p>
<p>In most cases, the trick is finding the magic CLI arg, environment variable, or config file setting that controls the host to bind to, and setting it to <code>::</code>.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>Some software (MongoDDB, Redis) goes out of their way to make <code>::</code> <em>not</em> be a dual-stack binding. In those cases, you have to configure both the IPv6 and IPv4 listeners separately.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ipv6-cheat-sheet">IPv6 cheat sheet<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#ipv6-cheat-sheet" class="hash-link" aria-label="Direct link to IPv6 cheat sheet" title="Direct link to IPv6 cheat sheet" translate="no">​</a></h2>
<p>Here's a bunch of examples of various apps I've had to learn how to get working with IPv6:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="aws-load-balancer-controller">aws-load-balancer-controller<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#aws-load-balancer-controller" class="hash-link" aria-label="Direct link to aws-load-balancer-controller" title="Direct link to aws-load-balancer-controller" translate="no">​</a></h3>
<p>You don't need to configure the <a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/" target="_blank" rel="noopener noreferrer" class="">aws-load-balancer-controller</a> itself any differently for IPv6, but when creating Ingresses that use it, they need to have the following annotations to support IPv6:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Tells the controller to create target groups of pod IP(v6) addresses. </span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># The "instance" target type won't work on IPv6.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">alb.ingress.kubernetes.io/target-type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ip</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Tells the controller to create a load balancer with IPv6 enabled</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">alb.ingress.kubernetes.io/ip-address-type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> dualstack</span><br></span></code></pre></div></div>
<p>The nice thing about this is that the load balancer itself will listen on IPv4 addresses (in addition to IPv6 addresses), which means IPv4 clients won't even know the app has been migrated.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>If you're using <a href="https://github.com/kubernetes-sigs/external-dns" target="_blank" rel="noopener noreferrer" class="">external-dns</a> to create Route53 entries for your load balancer Ingresses, keep in mind that it will create both A records (for the load balancer's IPv4 addresses) and AAAA records (for its IPv6 addresses). This will change the behavior of any IPv6-enabled clients making connections to that load balancer, such that they may prefer the load balancer's IPv6 addresses over its IPv4 addresses.</p><p>This may be fine, but it is one way in which the "only IPv6 inside the cluster" model leaks. For example, if you have security groups on the load balancer, you'll need to make sure you're adding IPv6 versions of any rules.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="ingress-nginx">ingress-nginx<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#ingress-nginx" class="hash-link" aria-label="Direct link to ingress-nginx" title="Direct link to ingress-nginx" translate="no">​</a></h3>
<p>In the helm values for <a href="https://github.com/kubernetes/ingress-nginx" target="_blank" rel="noopener noreferrer" class="">ingress-nginx</a>, you need to set the <code>ipFamilies</code> value to include <code>IPv6</code>:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">controller</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">service</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">ipFamilies</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> IPv6</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mongodb">MongoDB<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#mongodb" class="hash-link" aria-label="Direct link to MongoDB" title="Direct link to MongoDB" translate="no">​</a></h3>
<p>Mongo binds to IPv4 only by default. You can get it listening to IPv6/IPv4 (dual-stack) interfaces with the following command override:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">mongod --ipv6 --bind_ip ::,0.0.0.0</span><br></span></code></pre></div></div>
<p>Here's an example of a Kubernetes pod:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Pod</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> mongo</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> mongo</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">image</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> mongo</span><span class="token punctuation" style="color:#393A34">:</span><span class="token number" style="color:#36acaa">8</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">command</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> mongod </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">ipv6</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">bind_ip </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"::,0.0.0.0"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">ports</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">containerPort</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">27017</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">protocol</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> TCP</span><br></span></code></pre></div></div>
<p>More info: <a href="https://www.mongodb.com/docs/manual/core/security-mongodb-configuration/" target="_blank" rel="noopener noreferrer" class="">https://www.mongodb.com/docs/manual/core/security-mongodb-configuration/</a></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="redis">Redis<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#redis" class="hash-link" aria-label="Direct link to Redis" title="Direct link to Redis" translate="no">​</a></h3>
<p>Redis binds to IPv4 only by default. You can change it to bind to all interfaces with the following command override:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">redis-server --bind "0.0.0.0 ::"</span><br></span></code></pre></div></div>
<p>Here's an example of a Kubernetes pod:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> v1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Pod</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> redis</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> redis</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">image</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> redis</span><span class="token punctuation" style="color:#393A34">:</span><span class="token number" style="color:#36acaa">5</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">command</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> redis</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">server </span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">bind</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"0.0.0.0 ::"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">ports</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">containerPort</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">6379</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">protocol</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> TCP</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mariadb">MariaDB<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#mariadb" class="hash-link" aria-label="Direct link to MariaDB" title="Direct link to MariaDB" translate="no">​</a></h3>
<p>MariaDB 5.5+ already listens on <code>::</code> by default, so no additional configuration is needed.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="localstack">LocalStack<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#localstack" class="hash-link" aria-label="Direct link to LocalStack" title="Direct link to LocalStack" translate="no">​</a></h3>
<p><a href="https://localstack.cloud/" target="_blank" rel="noopener noreferrer" class="">LocalStack</a> currently <a href="https://docs.localstack.cloud/references/network-troubleshooting/" target="_blank" rel="noopener noreferrer" class="">doesn't support IPv6</a>. However, I've opened a <a href="https://github.com/localstack/localstack/pull/11601" target="_blank" rel="noopener noreferrer" class="">PR to add IPv6 support</a>. If that PR gets merged, then you'll be able to use an IPv6 address in the <code>GATEWAY_LISTEN</code> env variable:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">GATEWAY_LISTEN=[::]:4566</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="rabbitmq">RabbitMQ<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#rabbitmq" class="hash-link" aria-label="Direct link to RabbitMQ" title="Direct link to RabbitMQ" translate="no">​</a></h3>
<p>RabbitMQ listens on <code>::</code> by default, so no additional configuration is needed.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>Note that while the <code>rabbitmq:management</code> image binds automatically to the main amqp port (5672) on IPv6, the management API (port 15672) does NOT bind to IPv6.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="nginx">nginx<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#nginx" class="hash-link" aria-label="Direct link to nginx" title="Direct link to nginx" translate="no">​</a></h3>
<p>The <code>nginx</code> image's default config listens on both IPv4 and IPv6 by default.</p>
<p>If you're authoring your own <code>nginx.conf</code>, you need to add listeners for IPv6 and IPv4 separately. Here's an example of binding port 3001 on both IPv6 and IPv4:</p>
<div class="language-nginx codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">nginx.conf</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-nginx codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">listen       3001; # IPv4</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">listen  [::]:3001; # IPv6</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="otel-collector">OTEL collector<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#otel-collector" class="hash-link" aria-label="Direct link to OTEL collector" title="Direct link to OTEL collector" translate="no">​</a></h3>
<p>The <a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry Collector</a> config accepts the <code>::</code> (all interfaces) address any place you could specify an IP address. For example:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">config</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">receivers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">otlp</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">protocols</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">grpc</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">endpoint</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"[::]:4317"</span><br></span></code></pre></div></div>
<p>Other OTEL collector components will automatically use IPv6. For example, the Prometheus receiver correctly uses IPv6 pod addresses when scraping metrics.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="jaeger">Jaeger<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#jaeger" class="hash-link" aria-label="Direct link to Jaeger" title="Direct link to Jaeger" translate="no">​</a></h3>
<p>Jaeger already listens on <code>::</code> by default, so no additional configuration is needed.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="wiremock">WireMock<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#wiremock" class="hash-link" aria-label="Direct link to WireMock" title="Direct link to WireMock" translate="no">​</a></h3>
<p><a href="https://github.com/wiremock/wiremock" target="_blank" rel="noopener noreferrer" class="">WireMock</a> already listens on <code>::</code> by default, so no additional configuration is needed.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="gradio">Gradio<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#gradio" class="hash-link" aria-label="Direct link to Gradio" title="Direct link to Gradio" translate="no">​</a></h3>
<p><a href="https://www.gradio.app/" target="_blank" rel="noopener noreferrer" class="">Gradio</a> binds to <code>127.0.0.1</code> by default. You can use the <code>server_name</code> property to set up an IPv6 binding in the <code>launch()</code> method in the <a href="https://www.gradio.app/docs/gradio/blocks" target="_blank" rel="noopener noreferrer" class="">Blocks</a> object:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">blocks</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">launch</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">inline</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> server_port</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">5112</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> share</span><span class="token operator" style="color:#393A34">=</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> server_name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"[::]"</span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="uvicorn">Uvicorn<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#uvicorn" class="hash-link" aria-label="Direct link to Uvicorn" title="Direct link to Uvicorn" translate="no">​</a></h3>
<p><a href="https://www.uvicorn.org/" target="_blank" rel="noopener noreferrer" class="">Uvicorn</a> will bind to all IPv4/6 interfaces if you set <code>host='::'</code> in the <code>Config</code> object:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">ip_config </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">app</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">_fastapi_server</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> host</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"::"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> port</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">8080</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> Server</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">ip_config</span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<p><a href="https://github.com/encode/uvicorn/discussions/1529#discussioncomment-3061823" target="_blank" rel="noopener noreferrer" class="">More info</a></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="more-ipv6-cheat-sheet-examples-please">More IPv6 cheat sheet examples, please!<a href="https://labaneilers.com/incremental-ipv6-with-kubernetes#more-ipv6-cheat-sheet-examples-please" class="hash-link" aria-label="Direct link to More IPv6 cheat sheet examples, please!" title="Direct link to More IPv6 cheat sheet examples, please!" translate="no">​</a></h2>
<p>I'll be adding more IPv6/dual-stack configuration examples as I encounter them.</p>
<p>Do you have more? Leave them in the comments and I'll add them to the list!</p>]]></content:encoded>
            <category>devops</category>
            <category>platform-engineering</category>
            <category>kubernetes</category>
        </item>
        <item>
            <title><![CDATA[What would an OSS developer platform even look like?]]></title>
            <link>https://labaneilers.com/what-would-an-oss-developer-platform-look-like</link>
            <guid>https://labaneilers.com/what-would-an-oss-developer-platform-look-like</guid>
            <pubDate>Mon, 23 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[My team has built a developer platform that our developers really like, and is providing a ton of value for my company. But I'm struggling to figure out if and how we might open-source it. I'm looking for advice from you.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>My team has built a developer platform that our developers really like, and is providing a ton of value for my company. But I'm struggling to figure out if and how we might open-source it. I'm looking for advice from you.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/toolbox-446cdd1b72afbdc5a44918e9f2a6eb3d.jpg" class="blog-image" alt="A toolbox">
<p>As a platform engineer, I enjoy the benefits of working in a field with a vibrant ecosystem of open source infrastructure and developer tools. I've spent much of the last decade building developer platforms by curating and assembling these tools, and after a number of iterations, I seem to have hit on something that's working really well for my current company (SimpliSafe).</p>
<p>As our platform's adoption has grown, we've gotten more and more frequent, really positive, heartwarming feedback from our developers who really like it. This is <em>absolutely freaking delightful</em>, and honestly never stops surprising me.</p>
<p>I often get asked by our developers if we should consider open-sourcing the platform. I've spent some cycles entertaining the idea, but I usually don't get very far before it seems unworkable.</p>
<p>This post is an experiment in thinking in public; I'd like to brain dump my thoughts on the challenges of building an open-source developer PaaS, in the hopes that the platform engineering community might provide some insight to get me past this block.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="so-tell-me-more-about-this-platform">So, tell me more about this platform<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#so-tell-me-more-about-this-platform" class="hash-link" aria-label="Direct link to So, tell me more about this platform" title="Direct link to So, tell me more about this platform" translate="no">​</a></h2>
<p>Our platform is named "dex/EKS", which is (an admittedly awkward) combination of the name of the client tool, "dex", with the AWS service the server-side is built on (EKS: AWS's managed Kubernetes service). Unsurprisingly, developers tend to just call the whole thing "dex".</p>
<p>In the spirit of the "Platform Engineering" buzzword, dex/EKS encapsulates our company's collective opinions, policies, and best-practices for building, deploying, and operating apps. I like to think of it as a PaaS that we've curated and glued together out of a bunch of open-source and vendor tools.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p><code>dex</code> (the client tool itself) is a CLI tool for interacting with the platform. Picture the <code>flyctl</code>, <code>vercel</code>, or <code>heroku</code> CLI.</p><p><code>dex</code> is intentionally lowercase, or as I like to call it: "hipster-case". Or maybe camelCase without the humps? I dunno. It's a thing.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="our-kubernetes-distribution">Our Kubernetes distribution<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#our-kubernetes-distribution" class="hash-link" aria-label="Direct link to Our Kubernetes distribution" title="Direct link to Our Kubernetes distribution" translate="no">​</a></h3>
<p>In addition to the client tooling, we also have a fairly sophisticated Kubernetes "distribution", which consists of a bunch of curated cluster-side components, combined and configured to work well together. Our cluster configuration is defined with Terraform, and we use it with Github Actions to manage many dozens of EKS clusters. Beyond that, there's integrations with a bunch of third-party SaaS providers, including AWS services and other vendors.</p>
<p>Just to give you a sense of the ingredients that comprise the platform, here's a partial list:</p>
<ul>
<li class=""><a href="https://kubernetes.io/" target="_blank" rel="noopener noreferrer" class="">Kubernetes (AWS EKS)</a></li>
<li class=""><a href="https://www.docker.com/" target="_blank" rel="noopener noreferrer" class="">Docker</a>/<a href="https://github.com/moby/buildkit" target="_blank" rel="noopener noreferrer" class="">BuildKit</a></li>
<li class=""><a href="https://github.com/enterprise" target="_blank" rel="noopener noreferrer" class="">Github Enterprise</a> (with <a href="https://github.com/features/actions" target="_blank" rel="noopener noreferrer" class="">Github Actions</a>)</li>
<li class=""><a href="https://www.okta.com/" target="_blank" rel="noopener noreferrer" class="">Okta</a></li>
<li class=""><a href="https://jfrog.com/artifactory/" target="_blank" rel="noopener noreferrer" class="">Artifactory</a></li>
<li class=""><a href="https://grafana.com/products/cloud/" target="_blank" rel="noopener noreferrer" class="">Grafana Cloud</a></li>
<li class=""><a href="https://www.honeycomb.io/" target="_blank" rel="noopener noreferrer" class="">Honeycomb</a></li>
<li class=""><a href="https://opensearch.org/" target="_blank" rel="noopener noreferrer" class="">OpenSearch</a> (for logs)</li>
<li class="">A bunch of AWS services (ECR, S3, SSM, SecretManager, Route53, ACM, WAF, etc)</li>
</ul>
<p>Here's a few of the tools (from the Kubernetes ecosystem) we use in our EKS configuration:</p>
<ul>
<li class=""><a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/" target="_blank" rel="noopener noreferrer" class="">aws-load-balancer-controller</a></li>
<li class=""><a href="https://github.com/kubernetes/ingress-nginx" target="_blank" rel="noopener noreferrer" class="">ingress-nginx</a></li>
<li class=""><a href="https://karpenter.sh/" target="_blank" rel="noopener noreferrer" class="">Karpenter</a></li>
<li class=""><a href="https://kubernetes-sigs.github.io/external-dns/v0.14.0/" target="_blank" rel="noopener noreferrer" class="">external-dns</a></li>
<li class=""><a href="https://external-secrets.io/latest/" target="_blank" rel="noopener noreferrer" class="">external-secrets</a></li>
<li class=""><a href="https://fluentbit.io/" target="_blank" rel="noopener noreferrer" class="">Fluent Bit</a></li>
<li class=""><a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OTEL (OpenTelemetry) collectors</a></li>
<li class=""><a href="https://keda.sh/" target="_blank" rel="noopener noreferrer" class="">KEDA</a></li>
<li class=""><a href="https://kyverno.io/" target="_blank" rel="noopener noreferrer" class="">Kyverno</a></li>
<li class=""><a href="https://velero.io/" target="_blank" rel="noopener noreferrer" class="">Velero</a></li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="dex-the-cli-tool">dex: the CLI tool<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#dex-the-cli-tool" class="hash-link" aria-label="Direct link to dex: the CLI tool" title="Direct link to dex: the CLI tool" translate="no">​</a></h3>
<p>The CLI tool abstracts and integrates the APIs of these infrastructure components and exposes them through a simplified, declarative set of configuration and CLI commands. Some of the things it handles:</p>
<ul>
<li class="">Configuration management (what settings your app gets in different environments)</li>
<li class="">Secrets management</li>
<li class="">Cross-platform and multi-platform container image builds</li>
<li class="">User authentication (i.e. SAML auth via Okta, to Kubernetes, AWS, and Artifactory)</li>
<li class="">AWS IAM integration (allows you to assign AWS permissions to your app)</li>
<li class="">Kubernetes manifest management (imagine a simplified version of helm)</li>
<li class="">Ingress management (load balancers, certs, and DNS)</li>
<li class="">Vulnerability scanning</li>
<li class="">CI/CD integration (Github Actions)</li>
<li class="">Telemetry pipeline integration</li>
<li class="">Docker/Kubernetes-based integration testing framework</li>
</ul>
<p>One of the key metrics we hold ourselves to is that developers have to be able to get a new "hello world" service up and running in less than 10 minutes, at which point they can turn their focus to business problems. As they get closer to production, they have a few more decisions to make about autoscaling, observability, etc, but for the most part, the platform narrows down the choices to just a few fully-baked, meticulously documented, well-trodden paths.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>dex works off-road too</div><div class="admonitionContent_BuS1"><p>dex has some other extensibility mechanisms for more advanced use cases, such as the ability to author custom commands with arbitrary TypeScript, which can re-compose existing dex commands and any of it's constituent APIs.</p><p>Teams sometimes use this extensibility to explore the frontier of what's possible. If they find a new pattern to be useful, we will often incorporate it into the platform.</p><p>For example, dex's multi-region DNS configuration support was originally built by another team, who then contributed it upstream, so everyone else in the company could use it.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-impact-of-dex-at-simplisafe">The impact of dex at SimpliSafe<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#the-impact-of-dex-at-simplisafe" class="hash-link" aria-label="Direct link to The impact of dex at SimpliSafe" title="Direct link to The impact of dex at SimpliSafe" translate="no">​</a></h3>
<p>Teams at SimpliSafe have migrated the majority of our services to dex/EKS, and most teams are planning on moving their remaining services over in the next year or so. This has happened with close to zero pressure from management; teams are moving their services to the platform because they're much happier with it than without it.</p>
<p>Suffice to say, I'm very proud of this outcome, and dex seems to be providing a lot of value.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-platform-design-reflects-a-companys-culture">A platform design reflects a company's culture<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#a-platform-design-reflects-a-companys-culture" class="hash-link" aria-label="Direct link to A platform design reflects a company's culture" title="Direct link to A platform design reflects a company's culture" translate="no">​</a></h2>
<p>While it may appear that dex is just a set of choices about tools and services that have been manifested in glue code, it also reflects SimpliSafe's values and organizational culture. While most of these values and cultural properties fairly well subscribed, they're by no means universal:</p>
<ul>
<li class="">Teams should have a great deal of autonomy to choose tools, languages, frameworks, and processes, and they should be accountable for operating the systems they build</li>
<li class="">Continuous delivery is better than big-bang releases</li>
<li class="">Microservices are a good way for a large team to build a big system</li>
<li class="">A central team should own cross-cutting concerns like telemetry pipelines and observability backend tools, auth, infrastructure provisioning, etc</li>
<li class="">Infrastructure should be represented as code and managed through automation</li>
</ul>
<p>There's a bunch of others, but you get the picture.</p>
<p>These values deeply inform the design of dex, and it's interesting to look back on the iterations of platforms that I've built at different companies with different values, and how they were reflected in the design of the platforms.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="example-ephemeral-environments">Example: ephemeral environments<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#example-ephemeral-environments" class="hash-link" aria-label="Direct link to Example: ephemeral environments" title="Direct link to Example: ephemeral environments" translate="no">​</a></h2>
<p>Ephemeral developer environments are a feature I 100% know is a huge win for developers, regardless of your company culture. But there have been big differences in design, features, and implementation details when I've built this feature at different companies.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-are-ephemeral-environments">What are ephemeral environments?<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#what-are-ephemeral-environments" class="hash-link" aria-label="Direct link to What are ephemeral environments?" title="Direct link to What are ephemeral environments?" translate="no">​</a></h3>
<p>Here's the gist: A developer should be able to deploy their app into an isolated, temporary environment from their local machine into Kubernetes, with a single command, so they can iterate, tweak, and test ideas in a production-like environment. They'll need a URL to access the app, so they can play with usability, do manual testing, attach a debugger, troubleshoot config, etc. When they're done, they use another command to tear the whole thing down.</p>
<p>Additionally, for every feature branch they push to Github (or Gitlab, Bitbucket, etc), a CI job will deploy an ephemeral environment with the same properties, run automated tests, and tear it all down when complete (or after a specified period of time).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="ephemeral-environments-in-a-financial-institution">Ephemeral environments in a financial institution<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#ephemeral-environments-in-a-financial-institution" class="hash-link" aria-label="Direct link to Ephemeral environments in a financial institution" title="Direct link to Ephemeral environments in a financial institution" translate="no">​</a></h3>
<p>At my last job, in a commercial bank, we had a very small number of tightly controlled, multi-tenant Kubernetes (OpenShift) clusters. As you might imagine in a highly-regulated industry, the creation of Kubernetes namespaces (and the associated access controls) was governed by security controls, required approvals, and needed to leave an audit trail. The process was managed by a central team.</p>
<p>To allow creation of dynamic, isolated environments, we worked within the static structure of centrally managed namespaces by designing our tooling to generate Kubernetes objects using strict naming conventions (e.g. prefixing all resources with the name of the developer or feature branch). This allowed the tooling to manage the objects as a unit, avoid collisions, and ensure that the objects were cleaned up when the developer was done.</p>
<p>This design decision trickled into many other aspects of the system. For instance, we designed our client tooling to maintain pretty tight control over rendering the objects, and the relationship of objects via names, labels, and selectors.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="ephemeral-environments-at-an-iot-security-company">Ephemeral environments at an IoT security company<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#ephemeral-environments-at-an-iot-security-company" class="hash-link" aria-label="Direct link to Ephemeral environments at an IoT security company" title="Direct link to Ephemeral environments at an IoT security company" translate="no">​</a></h3>
<p>At SimpliSafe, the company's culture and preexisting architecture enabled a very different approach: ephemeral environments are implemented via Kubernetes namespaces, and the client tooling can create (and destroy) namespaces dynamically.</p>
<p>Because we have per-team AWS accounts, and our Kubernetes clusters already provide strong isolation, we're comfortable giving developers the power to manage namespaces in our non-production environments. This removes a lot of the need for strict control over object relationships in Kubernetes, and gives developers more flexibility to mess with the underlying objects more directly.</p>
<p>This additional power is a reflection of SimpliSafe's culture of autonomy and trust in developers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="different-tradeoffs-different-design">Different tradeoffs, different design<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#different-tradeoffs-different-design" class="hash-link" aria-label="Direct link to Different tradeoffs, different design" title="Direct link to Different tradeoffs, different design" translate="no">​</a></h3>
<p>So even with the same feature, providing pretty similar benefits to developers, we had to make very different tradeoff decisions, and ended up with design differences which significantly impact the architecture, features, and feel of the rest of the platform.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-would-this-look-like-open-sourced">What would this look like open-sourced?<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#what-would-this-look-like-open-sourced" class="hash-link" aria-label="Direct link to What would this look like open-sourced?" title="Direct link to What would this look like open-sourced?" translate="no">​</a></h2>
<p>Given the two examples of companies with different infrastructure opinions, let's think through the possible flavors of open-sourcing a developer platform like dex:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="option-1-a-hyper-opinionated-paas-in-a-box">Option 1: A hyper-opinionated "PaaS in a box"<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#option-1-a-hyper-opinionated-paas-in-a-box" class="hash-link" aria-label="Direct link to Option 1: A hyper-opinionated &quot;PaaS in a box&quot;" title="Direct link to Option 1: A hyper-opinionated &quot;PaaS in a box&quot;" translate="no">​</a></h3>
<p>This option assumes that the infrastructure decisions we've made at SimpliSafe would be a good fit for at least a bunch of other companies, with minimal modification. We'd provide the whole thing, end-to-end, including the EKS cluster configuration and terraform, all the cluster-side system components, and the dex client-side tool.</p>
<p>I find this option hard to imagine for a few reasons:</p>
<ul>
<li class="">While I'm very confident we've got a great solution for SimpliSafe, I think it's virtually impossible that any other company would be happy with <em>all</em> our opinions (the bank certainly wouldn't have been). Our platform glues together <em>scores</em> of specific OSS products (and a number of SaaS vendor tools), and the odds that <em>every one of them</em> lines up with another company's preferences is close to zero.</li>
<li class="">A platform engineering team using this version of the platform would be signing up to build expertise and support every OSS production we've chosen.</li>
<li class="">While out-of-the-box, opinionated platform might be good for a startup, our platform is certainly NOT the the right choice for a startup. It's designed around supporting many teams, and to allow a central platform engineering team to manage infrastructure underneath teams' apps... which is not the problem engineers at a startup should be worrying about.</li>
<li class="">Among the opinions encapsulated in our platform are some we're not happy about. We have a few compromises based on legacy infrastructure choices that are hard to change, and some choices which are an intermediate phase between where we are and where we want to go. For example, we're currently using the telegraf-operator to collect metrics for lots of our services, but we'd prefer to be using OTEL SDKs and/or Prometheus libraries.</li>
</ul>
<p>I actually can't imagine myself choosing to use someone else's OSS platform if it were built on this philosophy.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="option-2-a-whole-platform-but-pluggable">Option 2: A whole platform, but pluggable<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#option-2-a-whole-platform-but-pluggable" class="hash-link" aria-label="Direct link to Option 2: A whole platform, but pluggable" title="Direct link to Option 2: A whole platform, but pluggable" translate="no">​</a></h3>
<p>In this variant, we'd provide also provide the whole platform, but allow users to bring their own infrastructure opinions via a plugin API.</p>
<p>I also see some big disadvantages here:</p>
<ul>
<li class="">Abstraction layers add complexity. Part of the value of dex is that the code is relatively simple, straightforward, and hackable. We often get a PR or feature request, and end up cutting a new release within hours. This would not remain the case if we started adding abstraction layers everywhere.</li>
<li class="">Testing and maintaining compatibility with all possible plugins would be a huge burden. Right now, dex's integration tests are both comprehensive and fast, and it would be virtually impossible to maintain this level of coverage if we had to test against an ecosystem of plugins.</li>
<li class="">It's <em>really hard</em> to build good abstraction layers, even for simple things. And these infrastructure components are <em>definitely not simple</em>. We'd be constantly expanding and modifying the APIs to support additional opinions, and the abstractions would inevitably leak.</li>
<li class="">Many of the infrastructure choices we've made allow us to simplify the design of the platform, and these simplifying assumptions wouldn't be valid if we allowed arbitrary plugins. Tight coupling, in this case, is part of the special sauce for creating a really streamlined and cohesive developer experience.</li>
<li class="">Comprehensive documentation would be much more complicated and far less useful, since docs would have to simultaneously support the perspective both of the platform developer as well as the end-user developer. dex has lots of docs based on developer use cases, and it wouldn't be possible to provide these if the whole experience were built on plugins.</li>
<li class="">Kubernetes APIs already provide so much power and extensibility, and especially when you throw in Operators, CRDs, and custom controllers, it's hard to imagine how I could provide APIs that would support all the flexibility Kubernetes offers.</li>
</ul>
<p>I think realistically, this solution would start as option 1 and then abstractions would be gradually added by contributors to support their particular infrastructure choices, so it's probably best to think of this option as a spectrum with more or less pluggability.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="option-3-a-toolkit-for-building-your-own-platform">Option 3. A toolkit for building your own platform<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#option-3-a-toolkit-for-building-your-own-platform" class="hash-link" aria-label="Direct link to Option 3. A toolkit for building your own platform" title="Direct link to Option 3. A toolkit for building your own platform" translate="no">​</a></h3>
<p>Another option is to factor out individual components of the platform as standalone libraries, and let people build their own platform. I could imagine some of dex's components being useful for someone who wants to build a different opinionated platform.</p>
<p>One example of a generally useful component is our config system:</p>
<ul>
<li class="">Our config schema is defined as a tree of TypeScript classes, which can be used to generate a JSON schema, which can be used by other tooling to provide instant validation (e.g. via VSCode's JSON schema integration), to validate at runtime, an also to generate documentation.</li>
<li class="">The config system supports defining arbitrary target environments, which can use inheritance (and other mechanisms) to share common settings, and override them as needed.</li>
<li class="">It has a mechanism for declaring dynamic config values (e.g. a value from a Parameter Store secret, or based on the current git branch name, etc).</li>
<li class="">The config loader returns a config tree object which is built out of JavaScript Proxy objects, which allows us to do very smart validation, with user friendly error messages, and play to TypeScript's strengths.</li>
</ul>
<p>That said, turning this config module into a separate npm package would have some tradeoffs:</p>
<ul>
<li class="">The inherent packaging tax: working with multiple npm packages is more complex to develop, debug, and test locally.</li>
<li class="">It's abstractions would feel leaky to a user. For example, JSON schema generation requires some special build configuration, and this would appear a bit finicky if it was intended to be used off the shelf.</li>
<li class="">There's some other aspects of our config system that are currently tightly coupled with other parts of the platform. This is all just code, so of course we could figure out how to decouple it, but there would be a decent amount of net-new complexity as a result.</li>
</ul>
<p>More generally, I think the challenge with this approach is that most of the value of the platform stems not from the individual components, but from <em>their integration</em>. For example:</p>
<ul>
<li class="">dex's packaging/distribution mechanism (a stable CLI + a fast-moving, versioned library) has many moving parts</li>
<li class="">dex's own build system is very sophisticated, and has a lot of features around building canary releases, and enabling debugging in a sample host project</li>
<li class="">dex also has a suite of integration tests are fairly involved and comprehensive</li>
<li class="">The documentation of dex's UI (both its command line interface and config interface) is a <em>huge</em> factor in dex's success at SimpliSafe, and would have to be built from scratch for a new platform.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-plea-for-help">A plea for help<a href="https://labaneilers.com/what-would-an-oss-developer-platform-look-like#a-plea-for-help" class="hash-link" aria-label="Direct link to A plea for help" title="Direct link to A plea for help" translate="no">​</a></h2>
<p>So I'm sitting on this great set of tooling, which is providing a ton of value for my company. It's built on OSS, public cloud, and SaaS services, and there's no proprietary magic or novel intellectual property we're trying to protect. It solves a problem that a huge number of medium-large technology companies would have to tackle.</p>
<p>Why can't I see a way to share this with the world? Maybe I'm just not being imaginative enough. I'm 100% certain this isn't a novel situation.</p>
<p>What do you think?</p>]]></content:encoded>
            <category>devops</category>
            <category>platform-engineering</category>
            <category>kubernetes</category>
        </item>
        <item>
            <title><![CDATA[Building culture is hard, sustaining it is harder]]></title>
            <link>https://labaneilers.com/knowledge-management-culture</link>
            <guid>https://labaneilers.com/knowledge-management-culture</guid>
            <pubDate>Tue, 06 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[I experienced first-hand what it was like to work in a company with a really strong culture of knowledge management, and watched what it took to build and sustain it. I also witnessed the factors that caused it to eventually crumble.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>I experienced first-hand what it was like to work in a company with a really strong culture of knowledge management, and watched what it took to build and sustain it. I also witnessed the factors that caused it to eventually crumble.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/aqueduct-947fe484eb2dc45e50c357a55f92a86c.jpg" class="blog-image" alt="A Roman aqueduct">
<p>My current company is struggling with some challenges that are pretty typical for a wildly successful startup that's rapidly grown into a medium-sized company. We've got the expected technical debt, organizational design challenges, and a seemingly infinite number of small systems that work great... until they catch fire as we hit new scaling thresholds.</p>
<!-- -->
<p>I've been through this a few times before, and none of this is worrisome to me. It's exactly where I'd expect a company be after a period of frenetic growth. In a lot of ways, this is a really fun stage; if you're someone who can tolerate a bit of chaos and ambiguity, there's tons of opportunities to shape the direction of a company.</p>
<p>On one hand, at a startup, your priorities are dominated by survival and existential risk. But once you grow beyond a certain size, you begin to lose leverage as cultural inertia takes over. A scrappy, determined visionary can exert a lot of leverage during the awkward teenage years of a mid-sized company.</p>
<p>I want to tell a story about how I watched such a visionary person, at this moment in a company's lifespan, solve a particular problem around knowledge management, and how he drove a really compelling, sustained, positive cultural change.</p>
<p>Yeah, and also how it eventually collapsed, and why.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="someone-should-go-ask-rob-why-this-one-button-is-purple">Someone should go ask Rob why this one button is purple<a href="https://labaneilers.com/knowledge-management-culture#someone-should-go-ask-rob-why-this-one-button-is-purple" class="hash-link" aria-label="Direct link to Someone should go ask Rob why this one button is purple" title="Direct link to Someone should go ask Rob why this one button is purple" translate="no">​</a></h2>
<p>When I joined Vistaprint, it was already very successful and growing rapidly. It'd been a few years since I'd hopped jobs, so I was a bit nervous and eager to dive in and start getting shit done. This turned out to be a bit of a longer process than I had hoped.</p>
<p>While Vistaprint had a number of cultural virtues, one very annoying feature was how much it relied on oral tradition, and a few linchpin people who functioned as repositories of all knowledge. Seriously, there were like 5 or 6 people in the company that collectively held about 80% of the total knowledge, and it was getting increasingly hard for them to get any work done as the number of people whose questions they needed to answer was growing.</p>
<p>Here's some of the kinds of things I'm talking about:</p>
<ul>
<li class="">how to get your developer environment set up</li>
<li class="">the reasoning behind a particular software system's design</li>
<li class="">what UI ideas had been tested, and if they had succeeded or failed</li>
<li class="">why a particular compiler flag was used</li>
<li class="">how you'd get a new feature toggle enabled for analytics</li>
<li class="">who the hell owns any particular piece of code</li>
</ul>
<p>It was also more than just technical knowledge. HR policies, business strategy docs, organizational charts, holiday schedules; if these things existed, they were stored as a Word doc on someone's share drive, and were impossible to discover unless you already knew what you were looking for.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="dan-builds-a-wiki">Dan builds a wiki<a href="https://labaneilers.com/knowledge-management-culture#dan-builds-a-wiki" class="hash-link" aria-label="Direct link to Dan builds a wiki" title="Direct link to Dan builds a wiki" translate="no">​</a></h2>
<p>My colleague <a href="https://danieljbarrett.com/" target="_blank" rel="noopener noreferrer" class="">Daniel Barrett</a> (who you may know as a <a href="https://www.oreilly.com/pub/au/426" target="_blank" rel="noopener noreferrer" class="">prolific author</a> of some fantastic books on Linux and other topics) was already a veteran at Vistaprint when I joined, and was notably one of the "grownups" amidst a bunch of young upstarts. As part of his more general management duties, he had volunteered to figure out what how we were going to train the raging torrent of new hires that seemed to be showing up every day.</p>
<p>Dan recognized that training was actually a subset of the larger problem of knowledge management, and convinced the technology leadership team that we needed a more fundamental solution. He got to work on some ideas and started building a small prototype.</p>
<p>Not long afterwards, Dan presented at a technology team all-hands meeting, and introduced us all to a new wiki system (which at the time was called "TechWiki") that he'd built on top of <a href="https://github.com/wikimedia/mediawiki" target="_blank" rel="noopener noreferrer" class="">MediaWiki</a> (the software that powers Wikipedia). He had already seeded it with a straw-man categorization (based on his personal knowledge of the systems), and a number of stub pages. He explained his intention that we should all just start using it to write stuff down, and not worry so much about organization. He was going to work with teams to watch and learn how it was being used, and help coordinate efforts as it evolved.</p>
<p>There was a lot of skepticism across a number of fronts. Here are a few of the major objections that I remember:</p>
<ul>
<li class="">Everyone had full access to edit any page, at any time. How would we prevent people from screwing up each other's content?</li>
<li class="">Without a strict, hierarchical taxonomy, wouldn't everything just spiral into an unmaintainable mess?</li>
<li class="">Developers aren't particularly known to enjoy writing documentation. How are we going to get people to take time away from engineering to write stuff down?</li>
</ul>
<p>Dan had the wisdom to respond to these questions with the only correct answer: he didn't know. We were just going to try some stuff and see what happened.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-wiki-takes-off">The wiki takes off<a href="https://labaneilers.com/knowledge-management-culture#the-wiki-takes-off" class="hash-link" aria-label="Direct link to The wiki takes off" title="Direct link to The wiki takes off" translate="no">​</a></h2>
<p>It took a few months, but we started to notice that adoption of the wiki seemed to be growing pretty quickly. Daniel would pop by to check on us, let us know he had been reading what we'd been writing, and had some tips on how to use the wiki more effectively. He'd give us suggestions on organization, tone, and some general style tips.</p>
<p>He also encouraged us to be less "precious" with the wiki. He said it was a great place to take meeting notes, add team-specific content, and that we should feel free to create stub pages for topics we wished existed (but didn't know much about ourselves). What's more, he said we shouldn't hesitate at all to edit articles when we found missing or out-of-date information, regardless of whether we were the original author, or even if we didn't have any specific expertise or claim on the topic.</p>
<p>In the beginning, I'd often get email notifications that Dan had made minor edits to pages I'd created, usually normalizing titles, fixing typos, or adding category tags. It wasn't very long until I started noticing other people editing my pages. At first, I'd look at every change with some suspicion, but as it turns out... everyone editing my pages was actually doing a great job. They were genuinely improving content I'd written and adding details I'd missed. Even if the content they added wasn't 100% right, it was at least useful for me to know what answers they actually wanted, so I could correct any errors.</p>
<p>Meanwhile, while Dan continued his evangelization, he was also building a team that was driving improvements to the wiki software itself. We got more integrations with our other systems (e.g. JIRA integration, queries into our analytics database, links to shared drives, etc), and better search indexing. They were also working furiously in the background to keep the content categorized in a sane way, fixing typos and structural inconsistencies, adding searchable summaries, and making it possible to relax the stress on authors, while also maintaining some semblance of consistent organization.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Fun fact</div><div class="admonitionContent_BuS1"><p>From this experience, Dan later wrote the <a href="https://danieljbarrett.com/books/mediawiki/" target="_blank" rel="noopener noreferrer" class="">literal book on MediaWiki</a>.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="imagine-a-world-where-all-the-shit-is-written-down">Imagine a world where all the shit is written down<a href="https://labaneilers.com/knowledge-management-culture#imagine-a-world-where-all-the-shit-is-written-down" class="hash-link" aria-label="Direct link to Imagine a world where all the shit is written down" title="Direct link to Imagine a world where all the shit is written down" translate="no">​</a></h2>
<p>This momentum built on itself, and accelerated exponentially. It wasn't that much longer before the wiki became the go-to place for <em>everything</em>. The most notable effect of this change was that when you found yourself with a question, instead of looking for the expert, the first thing you'd do is just look it up on the wiki. It was shocking how often you'd find the answer, and if you didn't, you'd go find the answer offline, and then go and <em>create a damn wiki page about it</em>.</p>
<p>In stand-ups, team members would remind each other to update the wiki page for that system they just changed, process they added, or question they couldn't find the answer to yesterday. Managers would look at the volume (and quality) of wiki contributions when doing quarterly reviews (I remember one time I was the 2nd or 3rd biggest contributor, and was very proud).</p>
<p>This success was so significant that the rest of the company (outside of the technology team) took notice. There were originally concerns that MediaWiki's content editing interface, which required learning <a href="https://www.mediawiki.org/wiki/Wikitext" target="_blank" rel="noopener noreferrer" class="">Wikitext</a> (a content markup language similar to markdown) was going to be too much of a barrier for our non-technical colleagues. This turned out not to be a big deal, as it was pretty easy to learn just the small subset of features you needed to create to create most content. It wasn't much longer before "TechWiki" was rebranded to "VistaWiki", and was adopted across the whole company.</p>
<p>For someone who hasn't spent time in a company with this level of knowledge management practice, it's hard to describe how positive it was. I have little doubt that the wiki provided us a real, tangible competitive advantage. Experienced new hires inevitably noted how useful the wiki was, and how much it contributed to their effectiveness.</p>
<p>The era of VistaWiki lasted for about a decade, across several transitions of leadership, and through a period of really huge growth. All the while, Dan was working in the background to keep this culture alive and healthy.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="entropy-affects-culture-too">Entropy affects culture too<a href="https://labaneilers.com/knowledge-management-culture#entropy-affects-culture-too" class="hash-link" aria-label="Direct link to Entropy affects culture too" title="Direct link to Entropy affects culture too" translate="no">​</a></h2>
<p>A few years before I left, I noticed that something had changed. The wiki was still there, and Dan and his team were still plugging away, but it became increasingly obvious that employees, more and more, had started taking the wiki for granted. As employees turned over, the wiki culture was emphasized less and less to new hires, who became gradually more likely to try to find an experienced colleague instead of searching the wiki first.</p>
<p>Unfortunately, our tech leadership at this time, who had inherited this culture but didn't fully grasp its significance, wasn't spending much energy on preserving and promoting it. To be fair, they had a lot of other fires to fight, and its understandable that, like an increasing share of the population, they took it for granted too.</p>
<p>I was part of (middle) upper management at this time, and though we were spending a lot of energy on nurturing culture, knowledge management wasn't included in the features we were supposed to be promoting.</p>
<p>Some teams across the organization started experimenting with different documentation systems; ones that had features that MediaWiki lacked, or supported documentation that was generated from code, or were just more familiar to them from previous jobs. Most of these were implemented in a half-hearted way, contained siloed information, weren't well maintained, and didn't uphold the core value the wiki was built on: that knowledge should be free, transparent, and discoverable across the whole company.</p>
<p>About a year before I left, Dan popped over to let me know he was leaving, and was off to do something new. I was sad to see him go, but not particularly surprised. He'd been working tirelessly on this stuff for years, and was clearly exhausted trying to keep this thing alive in spite of a leadership that wasn't fighting very hard to keep it from disintegrating.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>leadership is hard too</div><div class="admonitionContent_BuS1"><p>I want to give the Vistaprint leadership of this era some grace, since I've come to know how incredibly hard it is to optimize investment decisions across so many different dimensions. I don't mean to disparage them, but I think its important to acknowledge this as a major factor in the loss of something that was really special and important.</p></div></div>
<p>After Dan left, the degradation accelerated. All the same factors that contributed to the exponential adoption seemed to be working in reverse. At one point I remember someone sending me a Word doc on Slack for something that would have 100% been a wiki article a few years before. It turns out, while this person had used the wiki, they weren't sure how to create an article, and didn't know if anyone would ever look for it there.</p>
<p>This moment reminded me of stories of medieval British peasants, walking past the ruins of Roman aqueducts, wondering where these strange, huge structures had come from, and what kind of people could have built such grand, otherworldly things. Maybe they had been built by giants?</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="tips-for-building-your-own-culture-of-knowledge-management">Tips for building your own culture of knowledge management<a href="https://labaneilers.com/knowledge-management-culture#tips-for-building-your-own-culture-of-knowledge-management" class="hash-link" aria-label="Direct link to Tips for building your own culture of knowledge management" title="Direct link to Tips for building your own culture of knowledge management" translate="no">​</a></h2>
<p>I think about this experience as just one data point in my broader understanding about what it takes to build (or change) a company's culture. To avoid overstating my point, I'll try to stick specifically to knowledge management.</p>
<p>Here's the lowdown:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="minimize-friction-for-contributors">Minimize friction for contributors<a href="https://labaneilers.com/knowledge-management-culture#minimize-friction-for-contributors" class="hash-link" aria-label="Direct link to Minimize friction for contributors" title="Direct link to Minimize friction for contributors" translate="no">​</a></h3>
<p>The experience for contributors needs to be as frictionless as possible. Every barrier you put in front of a potential contributor (having to create a PR, having to follow specific procedures, feeling like they have to ask permission) gets multiplied across every person in the organization.</p>
<p>Culture change is always an uphill battle; don't add unnecessary weight to your pack.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="access-control-is-counter-productive">Access control is counter-productive<a href="https://labaneilers.com/knowledge-management-culture#access-control-is-counter-productive" class="hash-link" aria-label="Direct link to Access control is counter-productive" title="Direct link to Access control is counter-productive" translate="no">​</a></h3>
<p>Don't fool yourself into thinking that locking your shit down is going to improve quality. Access controls are some of the worst kind of friction; they incentivize information silos, and discourage potential contributors that approach content with an outsider's perspective.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Diversity of perspective is a thing</div><div class="admonitionContent_BuS1"><p>Don't underestimate the power of an outsider perspective: people who aren't marinating in your specific team's minutiae will spot your implicit assumptions, opaque jargon, missing details, and errors that accumulate over time... especially when you're documenting an evolving system like software.</p></div></div>
<p>Empower everyone to edit everything. In any reasonably healthy organization, the number of people creating good, quality content will vastly outnumber the people messing stuff up.</p>
<p>Either way, if you have employees deliberately producing bad edits, you've got bigger problems than knowledge management.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>OK, fine.</div><div class="admonitionContent_BuS1"><p>OK, so some content legitimately requires access controls (e.g. HR/legal policies). Sure, lock that down. Just be really careful not to slide down that slippery slope and start locking down content because you have abstract notions of ownership or quality control.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="dont-be-precious-with-content">Don't be precious with content<a href="https://labaneilers.com/knowledge-management-culture#dont-be-precious-with-content" class="hash-link" aria-label="Direct link to Don't be precious with content" title="Direct link to Don't be precious with content" translate="no">​</a></h3>
<p>Avoid imposing implicit, social friction. Encourage contributors not to worry about taxonomy and organization; it creates unnecessary stress which discourages contribution. A pristine taxonomy is worthless without good content.</p>
<p>Also, don't be too pedantic about style, structure, or tone. If you're successful with adoption, you'll have too much content to ever have manually reviewed, so quality is something that inevitably will have to be addressed via culture (and not by creating speed bumps). If you get it right, people will <em>want</em> to write good content by virtue of social incentives, and because they get positive feedback from their peers and managers... or because they want the content for their own team or future selves.</p>
<p>In this situation, coaching and continuous feedback is much more effective than gatekeeping.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>Dan pointed out that he thinks a big factor in the success of Vistaprint's wiki was the consistent effort his team put in as editors: keeping things tidy and organized behind the scenes. This effort was obviously a significant cost, but it enabled them to promote the culture of "don't worry, just write it down" which was a huge part of the magic that made it all work.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="prefer-a-single-source-of-truth">Prefer a single source of truth<a href="https://labaneilers.com/knowledge-management-culture#prefer-a-single-source-of-truth" class="hash-link" aria-label="Direct link to Prefer a single source of truth" title="Direct link to Prefer a single source of truth" translate="no">​</a></h3>
<p>From the perspective of information consumers, you really want the knowledge repository to feel like a single system. You don't want users to have to ponder which of 7 different intranet portals to visit, depending on the type of document. If you don't have an actual unified knowledge management system, a good solution might be a unified search portal, or even just a norm of linking any external content from your primary system.</p>
<p>While a single source of truth isn't necessarily as important for contributors, I haven't yet seen a system with multiple sources of documentation where the fragmented experience ends up discouraging contributors. That said, I'd be curious if there's something that could be done with a federated system (e.g. a central system that scrapes individual sources, but generates content that contains links to edit the source).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="you-need-a-jump-start-to-get-critical-mass">You need a jump-start to get critical mass<a href="https://labaneilers.com/knowledge-management-culture#you-need-a-jump-start-to-get-critical-mass" class="hash-link" aria-label="Direct link to You need a jump-start to get critical mass" title="Direct link to You need a jump-start to get critical mass" translate="no">​</a></h3>
<p>Knowledge management culture has a chicken/egg paradox component: consumers won't use it if you don't have sufficient content, and it's hard to incentivize contributors to create content unless they believe it will be used by consumers. You need to find a way to prime the system and get it to become self-sustaining.</p>
<p>This probably requires a few different strategies deployed in parallel:</p>
<ul>
<li class="">Recruit key, influential people (probably the people who are already information bottlenecks) to start creating content</li>
<li class="">Practice sustained engagement with teams. Use this to help encourage content creation, remove implicit barriers, and identify sources of friction or stress.</li>
<li class="">Consider ways to seed the new system with content pulled (probably via automation) from existing sources. Even if this isn't perfect, there's something about seeing incomplete content that gets people to want to fill in the details.</li>
</ul>
<p>And probably a bunch of other you'll need to discover incrementally along the way. Much like <a class="" href="https://labaneilers.com/developer-experience-is-a-product">developer experience is a product</a>, so is content contributor experience.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="you-need-a-team-to-sustain-and-nurture-it">You need a team to sustain and nurture it<a href="https://labaneilers.com/knowledge-management-culture#you-need-a-team-to-sustain-and-nurture-it" class="hash-link" aria-label="Direct link to You need a team to sustain and nurture it" title="Direct link to You need a team to sustain and nurture it" translate="no">​</a></h3>
<p>Probably most importantly, you'll need someone (and probably a team) championing and driving the culture change. Some companies refer to this role as a "Librarian", but I think it's a lot more than that title evokes. While it is important that this person/team has good instincts around information architecture, it's much more about being an effective evangelist and relationship builder. It's fundamentally about changing the behavior of a large group of people, and that's <em>legitimately hard</em>.</p>
<p>This ongoing battle to sustain culture has two fronts:</p>
<ul>
<li class=""><strong>Grassroots</strong>: contributors have to fully buy in, and understand that the effort they put into creating content benefits them in short order.</li>
<li class=""><strong>Managing up</strong>: If upper management doesn't understand the value of a knowledge culture, it's going to be very hard to sustain it. The management hierarchy, at every level, has to be reminded how much they're getting out of the culture, and that they're responsible for keeping it alive.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="lets-extrapolate">Let's extrapolate<a href="https://labaneilers.com/knowledge-management-culture#lets-extrapolate" class="hash-link" aria-label="Direct link to Let's extrapolate" title="Direct link to Let's extrapolate" translate="no">​</a></h2>
<p>It might be worth making extrapolating a more general principle here about culture change: a great company culture can be a major driver for business success, but culture is a delicate, fickle thing. It can wither and die just as quickly as it was grown.</p>
<p>As leaders, we make decisions all the time about where to direct our limited resources. If there are aspects of your company culture that you believe really matter, take stock of what investment you're actually putting into it. I'm talking real, tangible resources; if you're not making real tradeoffs to sustain it, then you're choosing to let entropy take over. You may one day wake up and find that your teams are behaving in ways that are very counter to the culture you thought you valued.</p>]]></content:encoded>
            <category>leadership</category>
            <category>storytime</category>
        </item>
        <item>
            <title><![CDATA[That one time I did something important]]></title>
            <link>https://labaneilers.com/that-one-time-i-did-something-important</link>
            <guid>https://labaneilers.com/that-one-time-i-did-something-important</guid>
            <pubDate>Sun, 02 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[This is the story of the most impactful accomplishment of my career (building Vistaprint's Studio), which happened to be as an individual contributor.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>This is the story of the most impactful accomplishment of my career (building Vistaprint's Studio), which happened to be as an individual contributor.</p><p>For those of us who've actively chosen to remain active technologists, and have resisted the pressure to join management, it's important to remember that innovation is ultimately driven by <em>individuals</em>.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/light-bulb-b6dce3539fc25404a46592f0fb550dc6.jpg" class="blog-image" alt="Light bulb with a fire in it">
<p>A commonly accepted notion in software engineering leadership is that managers have a much bigger potential for impact on a business than an individual contributor. This is certainly a credible argument, given that a great manager can have a huge impact through building a great team. They're responsible for recruiting the right people, steering the culture, and making the biggest decisions about what risks to take, what opportunities to pursue, etc. Ultimately, they're accountable for what the team delivers.</p>
<!-- -->
<p>And of course, a big team can deliver bigger outcomes, with bigger impact, than any individual contributor could on their own. Given this leverage, its no wonder that it's very rare for companies to have pay scales for individual contributors that match that of managers, especially in senior management.</p>
<p>For those of us who find happiness and fulfillment in working directly with technology, our decision to avoid management can come with a significant economic penalty.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-penalty-for-optimizing-your-career-for-joy">The penalty for optimizing your career for joy<a href="https://labaneilers.com/that-one-time-i-did-something-important#the-penalty-for-optimizing-your-career-for-joy" class="hash-link" aria-label="Direct link to The penalty for optimizing your career for joy" title="Direct link to The penalty for optimizing your career for joy" translate="no">​</a></h2>
<p>I made an explicit decision a few years ago that I would leave management in order to get back to the things that originally drew me to technology. While I still think of what I do as leadership, I've come to terms with the fact that for the remainder of my career, I'm going to watch my former peers surpass me in titles, power, and especially in compensation.</p>
<p>At the beginning, this was a little hard on my ego, but over the past few years, I've come to a place of contentment. The amount I look forward to any given day of work is directly proportionate to the amount of uninterrupted time I have to work on engineering problems. I've decided my goal should be optimizing my career for happiness, so this tradeoff works for me.</p>
<p>But I want to push back a little on the idea that management is categorically more impactful than individual contribution. The concept is a bit of a tautology; managers take credit for the innovation and impact of the individual contributors they hire. But because of the nature of software, a single person with the right idea, at the right time, can manifest that idea into the world to great effect- sometimes without any organization supporting them at all.</p>
<p>I'd like to tell you the story of the most impactful thing I've ever done, which was as an individual contributor. Even though this was 18 years ago, I honestly don't know if I'll have another chance to do something quite like it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="do-you-remember-web-20">Do you remember Web 2.0?<a href="https://labaneilers.com/that-one-time-i-did-something-important#do-you-remember-web-20" class="hash-link" aria-label="Direct link to Do you remember Web 2.0?" title="Direct link to Do you remember Web 2.0?" translate="no">​</a></h2>
<p>Back in 2006, I joined Vistaprint to work on the team that owned its "Studio" application. Studio is Vistaprint's "PhotoShop in the browser", where customers can customize and edit designs that will then be used on custom-printed products (most famously business cards, but they also have hundreds of other products). This was a client-side web app, written in JavaScript/CSS, with a backend built (at the time I joined) in VB.NET.</p>
<p>Let me just set the stage, and remind my readers what the web was like in 2006: Microsoft absolutely dominated the browser space since winning the so-called "browser wars" back in the late nineties. Chrome didn't exist. Firefox had been around for 4 years, but it held a fraction of the market share of IE (version 7 at the time), and had virtually no appreciable advantages in user experience over IE. JavaScript was widely regarded as a toy language, and the browsers' engines were all equally, painfully slow. Honestly, none of us ever imagined JavaScript <em>could</em> be faster.</p>
<p>There was nothing like modern web frameworks like React or Vue. Even frameworks now considered legacy, such as jQuery, YUI, and MooTools, wouldn't have their first releases until later that year. The leading JavaScript frameworks at the time were Prototype and Dojo. Flash was still considered <em>the</em> technology for interactive web applications.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Comparison</div><div class="admonitionContent_BuS1"><p>There are now lots of apps that allow users to do sophisticated design in the browser (e.g. Canva, Figma). We didn't have any of the browser technologies back then that make this possible today: the <code>canvas</code> tag, SVG, WebAssembly, WebGL, or even half-decent JavaScript engines.</p><p>We were building web apps with mud, sticks, and gumption.</p></div></div>
<p>One thing that had happened the year before, however, was the initial release of Google Maps. While scrappy browser hackers had done some really cool and innovative stuff before, the effect of the launch of Maps was like a bomb going off in the web development world. It felt legitimately groundbreaking, and it was obvious at the time that this app was going to change the world.</p>
<p>I'm just putting that picture in your mind so you'll have a sense of what we even thought was possible at the time.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-origin-of-vistaprints-studio">The origin of Vistaprint's studio<a href="https://labaneilers.com/that-one-time-i-did-something-important#the-origin-of-vistaprints-studio" class="hash-link" aria-label="Direct link to The origin of Vistaprint's studio" title="Direct link to The origin of Vistaprint's studio" translate="no">​</a></h2>
<p>Vistaprint's first Studio had been built back around 2001, by a <a href="https://www.linkedin.com/in/businessintel/" target="_blank" rel="noopener noreferrer" class="">brilliant hacker</a> who embodied the type of contrarian scrappiness that was required to do anything on the web at the time. Despite the primitive browsers of the era, he was able to use a number of techniques, moderately well-known at the time (but not widely used), to do the kinds of things that web developers would later do with AJAX.</p>
<p>The most important of these hacks was to use multiply-nested <code>iframe</code>s (and a makeshift protocol that looked a little like JSONP) to communicate with the server without requiring a navigation on the main page. This allowed him to effectively simulate AJAX requests before browsers even had the capability.</p>
<p>What's even crazier about this is that the client-side code for Studio was used in an insane (but also brilliant) hack, where they <strong>loaded instances of IE <em>server-side</em></strong> in order to measure text bounding boxes, so that they could render documents for manufacturing. It turns out text rendering engines are really freaking complex, and given the fact that we relied on IE for text layout on the client, the most reliable way to ensure that customers' printed products were faithful to their browser renderings was to use the browser to render them on the server.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Hat tip</div><div class="admonitionContent_BuS1"><p>Despite all this hackery, I'm never going to criticize the folks that built what became a multi-billion dollar company. You do what you have to do to get a business off the ground, and Vistaprint's early team did incredibly creative and audacious stuff.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="studio-in-2006">Studio in 2006<a href="https://labaneilers.com/that-one-time-i-did-something-important#studio-in-2006" class="hash-link" aria-label="Direct link to Studio in 2006" title="Direct link to Studio in 2006" translate="no">​</a></h2>
<p>By the time I arrived in 2006, Vistaprint had built a very successful business on Studio, and had recently had their IPO, after delivering revenues of about $90MM in 2005. Studio was considered one of their major strategic "pillars", along with their novel manufacturing capabilities.</p>
<p>I spent the first couple of months working on Studio, trying to get my bearings, and wrap my head around what had become a pretty nasty mess of spaghetti over the past few years. Around that time, the team lead had decided to move away to follow his girlfriend out of state, and I was left in charge of an app I barely understood. It wasn't just me, though- my colleagues admitted that none of them had any confidence they could make any significant changes to the application without breaking things.</p>
<p>This is, as a matter of fact, exactly what happened... repeatedly. Vistaprint was growing its product portfolio, as well as trying to iterate to improve usability, and there had been a series of disastrous attempts to add new features to Studio to support this, each followed by an emergency rollback.</p>
<p>Beyond the maintainability challenges, you have to understand what the user experience for Studio was actually like. Because of all the complexity required to make the multiple-iframe communication mechanism work, along with years of features being layered on top, the user interface took a <em>minimum</em> of 60 seconds to load, even on a fast internet connection. About 20% of the time, something would fail (usually due to race conditions) and require a reload.</p>
<p>The app was deliberately styled to look like a Windows 95 desktop app, with CSS that had been carefully crafted to match the beveled edges, corporate grays, button styles, and fonts.</p>
<p>Studio only worked in IE. If you were unlucky enough to be using Firefox, Opera, or Konqueror, or you were on a Mac, you'd get redirected to a very limited, server-side, form-based page where you couldn't do any customization to your document other than edit the text.</p>
<p>The user interface was rife with bugs. We didn't have any real observability, but anecdotally, users would experience a blocking bug in at least 15% of sessions.</p>
<p>What's more, it was becoming less and less reliable to depend on IE to do server-side rendering, since over time, IE's text engine became more and more influenced by settings on Windows, graphics drivers, etc. We had a certain amount of documents that just got printed completely wrong, and had to be manually modified, by humans in our manufacturing plants.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hitting-rock-bottom">Hitting rock bottom<a href="https://labaneilers.com/that-one-time-i-did-something-important#hitting-rock-bottom" class="hash-link" aria-label="Direct link to Hitting rock bottom" title="Direct link to Hitting rock bottom" translate="no">​</a></h2>
<p>It wasn't long before I caused my first major production incident by attempting a bug fix in Studio, despite having been through what felt like a very meticulous QA cycle. After the rollback, we calculated the losses from the incident at about $20K, and I felt pretty deflated. My boss helped to put things in perspective, noting that these kind of losses were common in Studio, and my predecessor had caused many such incidents.</p>
<p>I spent a couple days feeling sorry for myself, and then resolve set in. I was having none of this. This was not OK.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="kindling-and-a-spark">Kindling and a spark<a href="https://labaneilers.com/that-one-time-i-did-something-important#kindling-and-a-spark" class="hash-link" aria-label="Direct link to Kindling and a spark" title="Direct link to Kindling and a spark" translate="no">​</a></h2>
<p>Around this time, a couple engineers had been working on a new set of server-side text rendering services that we could use for simpler products that didn't require Studio (this was especially appealing at the time, because the conversion rate in Studio was so terrible). I saw a demo that they'd built, and found myself unable to stop thinking about it for several days. One evening, while trying to get to sleep, I had a crazy idea.</p>
<p>What if we could build a brand-new Studio from scratch, where the document's elements would be composed of a set of server-rendered images? The client-side code would just be an interface for moving, resizing, and opening an editor for these elements, which from the perspective of the client, would just be rectangles with a set of properties. These elements would be the same types of elements users could edit in the legacy studio (e.g. text boxes, images, vector shapes, etc), but we'd build server-side rendering services for each one, which would output transparent PNG images so they could be composited together on the client.</p>
<p>The user could then just double click on any of the rectangles to open an element-specific editor. So for text boxes, this would open a simple text-box editor, which would allow the user to type, and then we'd debounce the keypress events to trigger a refresh of the server-rendered text.</p>
<p>This way, the documents we produced client-side could be faithfully rendered server-side using the same text-layout engine, and we could remove a huge amount of complexity on the client.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pitch">The pitch<a href="https://labaneilers.com/that-one-time-i-did-something-important#the-pitch" class="hash-link" aria-label="Direct link to The pitch" title="Direct link to The pitch" translate="no">​</a></h2>
<p>The next day, I spent some time with the engineers who had done the text rendering work, and we started working through the details of the idea. Once we felt like we had something viable, we brought my boss, <a href="https://www.linkedin.com/in/satishpai/" target="_blank" rel="noopener noreferrer" class="">Satish</a>, into the conversation. Having a shared experience of pain with Studio, he immediately arranged a meeting with <a href="https://www.linkedin.com/in/wendy-cebula-166a311/" target="_blank" rel="noopener noreferrer" class="">Wendy</a>, Vistaprint's head of "Capabilities Development" (I think she was technically the CIO at the time, but was still directly leading the Engineering team).</p>
<p>I explained the idea to her over the next half hour, and left with permission to suspend feature development on Studio, and to work with one of the rendering guys for a few weeks to build a prototype.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Why not Flash?</div><div class="admonitionContent_BuS1"><p>In the following few years, I had to address a lot of questions from folks about why I'd decided to use pure HTML/Javascript instead of Flash. This was three years before the iPhone, and Steve Job's famous refusal to allow Flash to run on it. Flash was still considered by many to be the best choice for rich, interactive experiences.</p><p>The real reason we didn't want to use Flash was that it would have made us dependent on Flash's proprietary text rendering engine (like we were on IE before it) for server-side rendering. It also wasn't clear that it was possible to use Flash server-side, or that Adobe wouldn't change something at any point that would break our whole system.</p><p>This turned out to be another very fortunate decision.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="building-the-new-studio">Building the new Studio<a href="https://labaneilers.com/that-one-time-i-did-something-important#building-the-new-studio" class="hash-link" aria-label="Direct link to Building the new Studio" title="Direct link to Building the new Studio" translate="no">​</a></h2>
<p>Within a week, the two of us had a working version of Studio that could create a new document, and had some basic editing features, including text editing and drag/drop positioning for all document elements. The results were fairly stunning in contrast to the legacy studio:</p>
<ul>
<li class="">It loaded in just a couple seconds</li>
<li class="">It worked in IE, but also Firefox and Opera. It also worked on a Mac.</li>
<li class="">It was smooth, snappy and responsive</li>
<li class="">The feel of the pop-up text editor, which we were afraid might be weird, was totally fine.</li>
</ul>
<p>As soon as she saw the prototype, Wendy gave us the green light to go all in. I spent the next few months turning the prototype into a real replacement for the legacy Studio, adding support for each of the element types needed to support our most important products, including business cards and postcards. My colleagues on the rendering side worked on building a new version of the program that would transform Studio documents into press-ready PDFs, using the new server-side text rendering engine (and NOT using IE).</p>
<!-- -->
<figure><img src="https://labaneilers.com/assets/images/studio-afc32d1ae34635d765e1a7317a21d679.png" alt="Vistaprint's Studio, today"><figcaption>Vistaprint's Studio, today. The code has been rewritten, but it still uses the same architecture I helped create in 2006.</figcaption></figure>
<p>We launched via an A/B test shortly thereafter. Most A/B tests run for changes to Studio for the last few years had either negative or statistically insignificant results, and despite how much better our new version felt, we thought the odds of hitting a home run on the first pitch were pretty low. We would have been happy with breaking even- at which point we would have been able to take advantage of the more maintainable codebase, and focus on optimizing.</p>
<p>When the first A/B test came back, we were floored. Conversion rate was up by about 5 <em>points</em>. Out of the gate, this hack was immediately worth tens of millions of dollars a year for Vistaprint, just for one product! And we had done <em>zero</em> optimization.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-results">The results<a href="https://labaneilers.com/that-one-time-i-did-something-important#the-results" class="hash-link" aria-label="Direct link to The results" title="Direct link to The results" translate="no">​</a></h2>
<p>Over the next year, the effects of this success rippled throughout the company. A large number of engineers were redirected to execute changes throughout the system needed to replace the old IE-based document rendering with the new server-side rendering engine. A team was built around me to keep developing the new Studio, and we gradually added the features needed to support an increasing share of Vistaprint's product portfolio.</p>
<p>All this time, Vistaprint was growing like crazy. Each time we'd move a product over to the new Studio, we'd see a huge jump in conversion rate. New core capabilities were being built on top the new Studio architecture, and a ton of new design content, products, and features were enabled. The process of rendering documents for manufacturing was far more efficient now that we had a reliable way to render documents that were faithful to the users' intentions, and we no longer needed a small army of humans to fix broken documents.</p>
<p>Every year we were growing revenue by hundreds of millions of dollars. It was an incredible ride.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reflecting-on-this-success">Reflecting on this success<a href="https://labaneilers.com/that-one-time-i-did-something-important#reflecting-on-this-success" class="hash-link" aria-label="Direct link to Reflecting on this success" title="Direct link to Reflecting on this success" translate="no">​</a></h2>
<p>I want to be clear that Vistaprint's success was due to <em>many</em> critical innovations and an enormous amount of work by many, many people, in areas like manufacturing, content design tooling, marketing, ecommerce, etc.</p>
<p>Also: even though I had come up with the core idea for the new Studio, it was based on many of the ideas from the old Studio, which itself required a lot of independent innovations. Beyond that, there's no way I could have come up with this idea, or had any hope of making it work, without the insight and skill of my teammates who had figured out the server-side rendering.</p>
<p>What's more, none of the engineering work I did was groundbreaking or mindblowing. I just synthesized some disparate ideas, from both inside and outside Vistaprint, and glued it all together with some (fairly decent) JavaScript and C#. I was just the right person, at the right place, at the right moment, with the right idea.</p>
<p>Even so, I still occasionally catch myself dwelling pridefully on this achievement. I imagine an alternate universe where I never joined Vistaprint, where they tried to incrementally improve the old Studio architecture. I don't see how they could have had the success that they did in <em>this</em> universe; the difference might be measured in hundreds of millions (maybe billions?) of dollars at this point.</p>
<p>I've done a lot of things I'm proud of since then, but I don't know how likely it is that I'll ever play such a pivotal role in building a multi-billion dollar company again.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="innovation-comes-from-individuals">Innovation comes from individuals<a href="https://labaneilers.com/that-one-time-i-did-something-important#innovation-comes-from-individuals" class="hash-link" aria-label="Direct link to Innovation comes from individuals" title="Direct link to Innovation comes from individuals" translate="no">​</a></h2>
<p>Thinking back on this episode in my career has been useful to remind myself that impact, especially via innovation, is ultimately driven by individual contributors. This is really important to remember for those of us who've chosen to optimize our careers around the joy of being a technologist, especially when the social and financial pressures to advance our careers through management are so potent.</p>
<p>My contrarian thesis aside, I have to acknowledge the complex interrelationship between ICs and managers when it comes to innovation. Technology leaders play their part to drive innovation by actively building a culture of empowerment and risk-taking, and being willing to make big bets on individuals with vision.</p>
<p>Perhaps it was only implied in my story, but this is exactly what Wendy had done for Vistaprint, long before I had arrived. She built an amazing team, and fostered a culture where engineers felt supported, trusted, and safe enough to invest their time where they thought opportunities existed.</p>
<p>I feel a great deal of gratitude to Wendy for having taken a risk in empowering me. It was a truly formative experience, and I still cant believe how lucky I was that the work I did had such a lasting effect on a great company.</p>]]></content:encoded>
            <category>leadership</category>
            <category>storytime</category>
        </item>
        <item>
            <title><![CDATA[Developer experience is a product]]></title>
            <link>https://labaneilers.com/developer-experience-is-a-product</link>
            <guid>https://labaneilers.com/developer-experience-is-a-product</guid>
            <pubDate>Sun, 26 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[The most important feature of an internal developer platform is that the team that builds it has to compete to win over their users.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>The most important feature of an internal developer platform is that the team that builds it has to compete to win over their users.</p><p>Figure out your initial value proposition, build a minimum viable product, get it in front of customers, listen, learn, and iterate.</p><p>Platforms imposed by a top-down mandate tend to fail.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/devx-soda-4173e49644dab69428bdd0a591175300.jpg" class="blog-image" alt="Developer Experience Soda">
<p>Over the past 15 years, I've been working on one form or another of internal developer platform. Even long before, while working at small startups, I inevitably ended up building (or curating) some little web framework, a build system, and slapping together scripts to package and deploy our stuff reliably. No one ever told me to do this, it was just obviously necessary.</p>
<p>In these cases, I was building a product for myself and my immediate team members, so it was a pretty tight feedback loop with the customer. I'd put a little extra effort to make things nice for other developers on my team, and also out of a bit of pride in making something that felt elegant.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="developer-experience-in-the-monolith">Developer experience in the monolith<a href="https://labaneilers.com/developer-experience-is-a-product#developer-experience-in-the-monolith" class="hash-link" aria-label="Direct link to Developer experience in the monolith" title="Direct link to Developer experience in the monolith" translate="no">​</a></h2>
<p>At the first larger company I worked for, I worked on improving the developer (and user) experience on top of a giant, pre-existing monolithic app, with a lot of custom tooling. One needed custom tooling when dealing with a monolith of several million lines of code being concurrently modified by 200 developers, especially since there wasn't really any off-the-shelf tooling available that could handle this scale.</p>
<p>Since there was already an established build and deployment system, I was mostly focused on improving the experience of web developers. At that time, the challenge was mostly around providing a sprawling army of mostly backend developers with a decent library of web UI components, and achieving some semblance of brand consistency.</p>
<p>This whole thing required some culture change, and a lot of outreach. I had no power to enforce usage of our web framework, nor any power to force web designers to work within the constraints we'd defined together. To get the designers on board, we needed to build some trust, listen to their concerns, and help them see we were trying to help them realize their vision with greater fidelity.</p>
<p>For developers, it just required that our framework was easier and better at helping teams make their pages look like what the designers wanted. Ultimately, no one ever had to force anyone to use our framework, it just made things easier for everyone, so they did it.</p>
<p>The next time we had to do a brand refresh of the site, it only took a couple people a week or so, whereas the last rebrand had been a major project across the whole company that took months. This was a small win against the entropy which was slowly devouring our monolith.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="microservice-babies">Microservice babies<a href="https://labaneilers.com/developer-experience-is-a-product#microservice-babies" class="hash-link" aria-label="Direct link to Microservice babies" title="Direct link to Microservice babies" translate="no">​</a></h2>
<p>A few years later, it was becoming apparent that we had been gradually losing productivity in our monolith, and there were some factions interested in pursuing a <a href="https://en.wikipedia.org/wiki/Service-oriented_architecture" target="_blank" rel="noopener noreferrer" class="">service-oriented architecture</a>. A new platform team started working on a set of tooling to enable teams to stand up independent services outside of the monolith.</p>
<p>In a marked contrast to the proprietary infrastructure we'd been using for the monolith, they were toying around with a bunch of different open source and vendor tools. After some prototyping and getting an initial MVP built, some teams started using their stuff.</p>
<p>Unfortunately, the fate of this particular platform was to fizzle. In retrospect, there was a lot going against it:</p>
<ul>
<li class="">We weren't yet using a public cloud (they were targeting on-prem infrastructure)</li>
<li class="">Kubernetes and containers were in their infancy</li>
<li class="">We were legitimately deluded about what it would take to make microservices actually work. Seriously, we were like little babies.</li>
</ul>
<p>These were pretty strong headwinds, but there was another factor I can see in retrospect, which was even more critical:</p>
<p><em>The platform team didn't spend enough time learning from their customers, or trying to understand the actual problems they were facing</em>.</p>
<p>I remember several specific tales of teams working on building services outside our monolith using their tooling, running into some friction, and experiencing something other than empathy from the platform team. At least once I remember a VP getting involved to put pressure on a team that was expressing reservations and looking for alternative approaches.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Caution: Fuzzy recollection</div><div class="admonitionContent_BuS1"><p>My recollection may not be 100% accurate, since it was a while ago, and I wasn't privy to all the goings-on. I have the impression the team gained the backing of leadership, who provided some degree of pressure on teams to adopt their tooling.</p><p>I'm not sure to what degree this pressure was applied in practice, but I do remember that the teams believed usage of the platform was expected of us.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="building-a-great-platform-for-the-wrong-customer">Building a great platform for the wrong customer<a href="https://labaneilers.com/developer-experience-is-a-product#building-a-great-platform-for-the-wrong-customer" class="hash-link" aria-label="Direct link to Building a great platform for the wrong customer" title="Direct link to Building a great platform for the wrong customer" translate="no">​</a></h2>
<p>What was wrong with the product? Let me explain:</p>
<p>The toolkit was built as a set of Ruby gems, referenced via a root gem that composed them to enable some higher-level operations. Each gem was responsible for interacting with some of the platform's parts, such as Artifactory, Jenkins, or whatever deployment tool we were using (I think at one point it was <a href="https://octopus.com/" target="_blank" rel="noopener noreferrer" class="">Octopus Deploy</a>?). The tool would scaffold out a rakefile, with predefined tasks (e.g. build, deploy) the user could execute with the <code>rake</code> CLI. The user could then customize their rakefile, combining these gems to implement all the custom processes their service needed, but a lot of the low-level details would be taken care of within the gems.</p>
<p>Here's where the problems started: the gems were fairly course-grained, strongly opinionated, and had pretty limited extensibility. They were also composed and packaged in a way that made it hard to replace any single gem with an alternative implementation. The options for a user that had an unsupported use case were pretty limited:</p>
<ul>
<li class="">Get the tooling owners to implement the feature</li>
<li class="">Implement the feature and try to convince the owners to take the patch</li>
<li class="">Implement the feature from scratch outside of the pre-existing gems</li>
</ul>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>hindsight</div><div class="admonitionContent_BuS1"><p>I think that monkey patching may also have been an option. I don't think any of us had enough Ruby experience to know that was a thing.</p></div></div>
<p>All of these options were particularly unappealing, partially because there was virtually no Ruby experience to be found among our developer population. But more importantly, after a number of teams' feature requests were met with apathy (and a bit of paternalistic "you're doing it wrong"), the "brand" of the tooling began to suffer.</p>
<p>Despite strong top-down pressure to use the tooling, teams openly rebelled and began piecing together their own bespoke solutions. In the long run, management gave up the fight, because ultimately, they just cared that business problems were getting solved.</p>
<p>In retrospect, I think leadership's decision to push the tooling was its death knell. The platform team lost its incentive to win the trust of the developers, and got caught up in their own vision. They built a beautiful product, it just didn't happen to be the product we needed.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Reality check</div><div class="admonitionContent_BuS1"><p>I should note that despite how things turned out with this particular platform, I can't deny that this team's work had a huge influence on the way I've thought about platform engineering ever since.</p><p>This is one of the rare occasions on which I've had the wherewithal to learn from others' failures instead of my usual approach of repeating all the same mistakes myself.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubernetes-emerges-from-the-chaos">Kubernetes emerges from the chaos<a href="https://labaneilers.com/developer-experience-is-a-product#kubernetes-emerges-from-the-chaos" class="hash-link" aria-label="Direct link to Kubernetes emerges from the chaos" title="Direct link to Kubernetes emerges from the chaos" translate="no">​</a></h2>
<p>A <a class="" href="https://labaneilers.com/let-a-thousand-flowers-bloom">few years and a regime change later</a>, we had a whole lot of teams individually managing their own build/deployment tooling. This was, in no small part, a reaction to the bad experiences many of us had with the aforementioned platform team. It seemed obvious to me at the time that there was a lot of waste in having every team have to rediscover their own solution, but I also acknowledged that the alternative of a central platform team managing this for everyone hadn't worked so well last time.</p>
<p>Meanwhile, management had blessed adoption of AWS. At the time, we had a vague and naive idea that AWS was a ready-to-use platform. and hadn't yet come to terms with its true character: an extremely powerful, but low-level set of primitives. They had a few offerings at the time that looked a little like a PaaS if you squinted, but we seriously underestimated the amount of boilerplate glue scripts we had to write to, for example, get a service built and deployed on ECS or Elastic Beanstalk.</p>
<p>One team in particular had been toying around with Kubernetes and was having some success. While I'd used docker a bit, and had been following the orchestrator wars (mostly as a lurker on Hacker News), I didn't yet see what the big deal was. But smart people I respected were saying good things about it, including words I liked, like "rolling deployments", "autoscaling", and "self-healing".</p>
<p>I had just spent the previous 2 months trying to help another team, who had been struggling to execute a basic blue-green deployment with CloudFormation. Then I saw a demo in which a <code>kubectl apply</code> of a single <code>deployment.yaml</code> file executed a seamless rolling update of a service within a minute, and I was sold.</p>
<p>As I learned more about the abstractions Kubernetes was built around, my thoughts returned to the idea of creating a developer platform. It seemed possible that containers and the Kubernetes API might be the membrane we needed to give developers autonomy over all things they cared about, while enabling central management of the stuff they didn't. The <a class="" href="https://labaneilers.com/devops-is-a-stew">ingredients of the devops stew</a> were finally all out on the counter.</p>
<p>It took some convincing, but I managed to get some of the influential developers on board with the idea that we'd create a new platform team, and attempt to build a scaled-up, multi-tenant version of the Kubernetes solution they had pioneered. We started the team, and spent most of the first month learning how to build and operate a cluster with <a href="https://github.com/kubernetes/kops" target="_blank" rel="noopener noreferrer" class="">kOps</a> (EKS either didn't exist yet, or was too new to consider seriously).</p>
<p>We got a couple of the teams to try it out, and found that it was, indeed a Kubernetes cluster; it allowed us to define workloads and roll them out reliably. This was a huge improvement. But it didn't take long until the teams using it had accumulated a bunch of shell scripts and additional tooling to manage a few other things:</p>
<ul>
<li class="">Authentication to Kubernetes, Artifactory, and other services</li>
<li class="">Running docker builds (passing in build args, ssh-agent sockets, managing cache volumes, etc)</li>
<li class="">Defining per-environment configurations</li>
<li class="">Syncing secrets between our secret store and Kubernetes</li>
<li class="">Managing load balancers, DNS, and certs</li>
<li class="">Orchestrating integration tests with a bunch of docker containers</li>
</ul>
<p>Once again, each individual team was re-solving the same problems, each with their own flavor of tradeoffs and bugs. Clearly, we had a lot more opportunity to provide value here.</p>
<p>Coincidentally, I was fighting a little burnout around this time, and ended up deciding that 13 years was long enough in one company. I never got to take this particular platform further, but the shape of the problem space had become a lot clearer in my head.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="three-developer-platforms-later-lessons-learned">Three developer platforms later, lessons learned<a href="https://labaneilers.com/developer-experience-is-a-product#three-developer-platforms-later-lessons-learned" class="hash-link" aria-label="Direct link to Three developer platforms later, lessons learned" title="Direct link to Three developer platforms later, lessons learned" translate="no">​</a></h2>
<p>Over the next 7 years, I've iterated on this idea three more times at two different companies, all built on Kubernetes. The results have been increasingly compelling with each iteration, and I've added a lot of key elements to the approach. The central idea has become:</p>
<p><em>The platform encapsulates the operational, cultural, and security opinions of the organization, gluing together the company's chosen infrastructure and tooling.</em></p>
<p>There are a lot of principles and patterns underneath this high-level idea, but there are a few, universal key dimensions along which you have to strike a balance:</p>
<ul>
<li class="">Finding the right line between things that have to be standardized, and things where there's value in flexibility and autonomy for teams.</li>
<li class="">Adding enough power so that the platform can support all the use cases in your company, while also having a small number of simple, default paths that work for the vast majority of cases.</li>
<li class="">Creating the right extensibility points, allowing teams to solve their own problems without the platform team being a bottleneck, but still maintaining enough coherence in the core aspects of the platform so it can evolve and improve over time.</li>
</ul>
<p>The right balance in these dimensions is highly dependent on the culture and values of your organization, but there's one thing I'm pretty sure is universal, which is how you find that balance:</p>
<p><strong>Treat your developer experience like a product.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="finding-product-market-fit">Finding product-market fit<a href="https://labaneilers.com/developer-experience-is-a-product#finding-product-market-fit" class="hash-link" aria-label="Direct link to Finding product-market fit" title="Direct link to Finding product-market fit" translate="no">​</a></h2>
<p>I don't think this is any different than a startup would approach things:</p>
<ul>
<li class="">Observe your developers, listen to them, learn about their pain</li>
<li class="">Formulate hypotheses about how you can alleviate that pain</li>
<li class="">Build a minimum viable product</li>
<li class="">Get it in front of developers</li>
<li class="">Listen, learn, and iterate</li>
<li class="">Pivot if what you're building doesn't resonate</li>
</ul>
<p>Don't fall in love with your own vision. When developers ask for a feature, don't dismiss them, even if you don't see where it fits on your roadmap. Regardless of the implementation details they may be stuck on, they're giving you critical information about their pain points. Don't squander that opportunity.</p>
<p>If you're exceptionally visionary, you may have innovative, paradigm-shifting ideas for solutions that developers don't even know they need. That's great, but you should slow your roll. Use the scientific method: test and learn. Maybe you have the <a href="https://thenewstack.io/solomon-hykes-leader-open-source-world-needs/" target="_blank" rel="noopener noreferrer" class="">wisdom of Solomon Hykes</a>, but the odds are against you. In reality, 99% of ideas you think are novel aren't actually new, they just got quietly discarded because they didn't work in practice.</p>
<p>For a internal developer platform, you don't have to be particularly innovative, and you certainly don't have to be original. In fact, it's usually a lot better to shamelessly steal ideas from successful platforms outside your company. Bias towards open-source tools, vendor or cloud-provider products. Rip off the CLI interface of a popular PaaS product your developers are already using for their side hustle.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="congratulations-youre-a-brand-manager">Congratulations, you're a brand manager<a href="https://labaneilers.com/developer-experience-is-a-product#congratulations-youre-a-brand-manager" class="hash-link" aria-label="Direct link to Congratulations, you're a brand manager" title="Direct link to Congratulations, you're a brand manager" translate="no">​</a></h2>
<p>And like a startup, you're also the steward of your product's brand. You have to earn trust with your customers, show them that you're listening to their feedback, and that you're committed to making their lives better.</p>
<p>Your brand is also relevant to stakeholders besides your direct customers, including leadership, security teams, product owners, etc. If they don't understand your value proposition, they'll a good chance they'll be asking uncomfortable questions at a moment when its least helpful.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="remember-you-dont-have-the-monopoly-you-think-you-do">Remember, you don't have the monopoly you think you do<a href="https://labaneilers.com/developer-experience-is-a-product#remember-you-dont-have-the-monopoly-you-think-you-do" class="hash-link" aria-label="Direct link to Remember, you don't have the monopoly you think you do" title="Direct link to Remember, you don't have the monopoly you think you do" translate="no">​</a></h2>
<!-- -->
<figure class="blog-image"><img src="https://labaneilers.com/assets/images/monopolist-3436af0af6fa53f9c102588885db78c3.jpeg" alt="Monopolist"><figcaption>You don't want to be this guy.</figcaption></figure>
<p>In a few cases, especially with the latest developer platform I've worked on, I've had to fend of requests from leadership who'd like to accelerate adoption by cranking up pressure on teams to use our stuff. Certainly, there are benefits to the organization in standardizing (especially for security and cost management). But each time I push back.</p>
<p>For one thing, we haven't needed to do anything to drive demand; teams are migrating services whenever they can spare a sprint... because they like what we've built and they know they have a say in the direction we take it. We're going to get to 100% adoption at some point soon, and we won't have ever forced anyone's hand.</p>
<p>I think this principle is pretty universal for teams working on internal tooling. When you're tempted to use management to force people to use your product, step back and consider the big picture. You don't have the monopoly you think you do. Companies evolve and change; new executives and managers come into power, technologies evolve, and the business climate changes.</p>
<p>If you want to stay on top, you have to acknowledge that you're always competing for your customers' business. If they're happy with the platform, and trust you to keep improving it, they'll defend you against shifting tides. If they're not, they'll abandon ship as soon as another option presents itself.</p>]]></content:encoded>
            <category>devops</category>
            <category>platform-engineering</category>
            <category>kubernetes</category>
            <category>storytime</category>
        </item>
        <item>
            <title><![CDATA[Prometheus vendor death match]]></title>
            <link>https://labaneilers.com/prometheus-vendor-death-match</link>
            <guid>https://labaneilers.com/prometheus-vendor-death-match</guid>
            <pubDate>Sun, 12 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[We evaluated a number of observability vendors, with a focus on metrics, and did detailed PoCs with both Chronosphere and Grafana Cloud. Both are excellent products, and have slightly different strengths.]]></description>
            <content:encoded><![CDATA[<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>TL;DR</div><div class="admonitionContent_BuS1"><p>We evaluated a number of observability vendors, with a focus on metrics, and did detailed PoCs with both <a href="https://chronosphere.io/" target="_blank" rel="noopener noreferrer" class="">Chronosphere</a> and <a href="https://grafana.com/products/cloud/" target="_blank" rel="noopener noreferrer" class="">Grafana Cloud</a>. Both are excellent products, and have slightly different strengths.</p></div></div>
<!-- -->
<img src="https://labaneilers.com/assets/images/fighters-e3cc7c7007d4b8cf1e695e68a87f5a28.jpg" class="blog-image" alt="Death match">
<p>At work, we're in the process of rebuilding our metrics pipeline, as we've outgrown our old self-managed TIG (Telegraf, InfluxDB, Grafana) solution. We've had this solution in place for many years, and it's served us well. Especially given the increasingly predatory pricing models of observability vendors, it's been extraordinarily cost-effective.</p>
<p>But over the last couple years, as we've grown, we've started to hit the limits of what we can handle with a single, vertically scaled instance of InfluxDB (especially using InfluxDB v1). It was increasingly stressful to keep it running smoothly, and we had to be very vigilant about cardinality, as it's very easy to accidentally introduce a cardinality explosion that can bring down the entire database.</p>
<!-- -->
<p>Just upgrading InfluxDB would have been similar in scope to moving to a new vendor, since InfluxDB v2 has a new query language, and we would have had to rewrite all our queries and dashboards anyways. So we decided to take the opportunity to do a proper RFP, and see if we could find a vendor to take this entire problem off our plate.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-landscape">The landscape<a href="https://labaneilers.com/prometheus-vendor-death-match#the-landscape" class="hash-link" aria-label="Direct link to The landscape" title="Direct link to The landscape" translate="no">​</a></h2>
<p>We decided to focus specifically on metrics, since we needed to limit the scope of our evaluation by some criteria. Observability products vary drastically across many dimensions, and metrics were the area where we were in the most pain.</p>
<p>The vendors we considered fell into a few categories:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-dominant-players">The dominant players<a href="https://labaneilers.com/prometheus-vendor-death-match#the-dominant-players" class="hash-link" aria-label="Direct link to The dominant players" title="Direct link to The dominant players" translate="no">​</a></h3>
<p>These include both <a href="https://www.datadoghq.com/" target="_blank" rel="noopener noreferrer" class="">Datadog</a> and <a href="https://newrelic.com/" target="_blank" rel="noopener noreferrer" class="">New Relic</a>, which are both well established and <em>very</em> feature-rich. They're also known for being <em>extremely</em> expensive, and having pricing models that are difficult to predict or control. I've talked to some friends who've worked with them, and they said although they were great products, it was pretty typical that the cost would be 30% over an already bloated budget every year. But because they were so locked in, every year they'd have to renew, and the sales people would show up looking for another pound of flesh.</p>
<p>Another thing we noticed about the dominant players was their transparently conflicted relationship with <a href="https://opentelemetry.io/" target="_blank" rel="noopener noreferrer" class="">Open Telemetry</a>. While OTEL support features prominently in their marketing materials, their documentation tells a different story. Customers who choose to instrument their systems with OTEL SDKs will find that they're missing a whole lot of the best features of these products. The sales folks were not exactly subtle about recommending we run the evaluation using their proprietary agents instead.</p>
<p>Yuck on both fronts.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-up-and-comers">The up-and-comers<a href="https://labaneilers.com/prometheus-vendor-death-match#the-up-and-comers" class="hash-link" aria-label="Direct link to The up-and-comers" title="Direct link to The up-and-comers" translate="no">​</a></h3>
<p>There's a few interesting, smaller players in the space. We looked at SigNoz, Logit, and a few others. They all appeared to be offering basically the same thing: a hosted, Prometheus-compatible backend along with a Grafana-based front end. They all had very competitive pricing, but we felt a bit concerned at how immature they were, and decided against doing a full evaluation.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-cloud-provider-option">The cloud-provider option<a href="https://labaneilers.com/prometheus-vendor-death-match#the-cloud-provider-option" class="hash-link" aria-label="Direct link to The cloud-provider option" title="Direct link to The cloud-provider option" translate="no">​</a></h3>
<p>Since we're an AWS shop, we also considered using <a href="https://aws.amazon.com/prometheus/" target="_blank" rel="noopener noreferrer" class="">AWS's Managed Prometheus offering</a>, which would have simplified some of the operational complexity of running a Prometheus backend (e.g. <a href="https://thanos.io/" target="_blank" rel="noopener noreferrer" class="">Thanos</a> or <a href="https://cortexmetrics.io/" target="_blank" rel="noopener noreferrer" class="">Cortex</a>) ourselves. Doing some back-of-the-napkin math, we realized that, if we didn't do anything differently, we'd end up spending about 3x what we were currently spending on InfluxDB. Plus, we'd still have to manage our own Grafana instance.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-goldilocks-zone">The goldilocks zone<a href="https://labaneilers.com/prometheus-vendor-death-match#the-goldilocks-zone" class="hash-link" aria-label="Direct link to The goldilocks zone" title="Direct link to The goldilocks zone" translate="no">​</a></h3>
<p>We also looked at a few vendors that were in the middle of price distribution, such as <a href="https://chronosphere.io/" target="_blank" rel="noopener noreferrer" class="">Chronosphere</a> and <a href="https://grafana.com/products/cloud/" target="_blank" rel="noopener noreferrer" class="">Grafana Cloud</a>. These were also Prometheus-compatible, with Grafana front-ends, but both companies were reasonably established, and had similar looking feature sets.</p>
<p>Chronosphere was the first vendor we decided to evaluate, because their sales pitch included something we hadn't heard from any other vendor; they'd provide a way for us to manage costs with powerful, centralized ingestion controls, an as a result, could offer us predictable pricing.</p>
<p>This piqued our interest, not just for managing costs, but because we'd long had problems with cardinality. At any moment, cardinality from any given service could unexpectedly explode- based on the decisions of a single programmer. For example, we've occasionally had instances of programmers inserting a metric label where the value is a unique ID (such as a customer ID), which could have hundreds of thousands of possible values. Previously, we had to be extremely vigilant about this, and pounce on any team that introduced cardinality explosions before they could bring down our InfluxDB backend.</p>
<p>So having a way to manage cardinality, centrally, was very enticing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="evaluating-chronosphere">Evaluating Chronosphere<a href="https://labaneilers.com/prometheus-vendor-death-match#evaluating-chronosphere" class="hash-link" aria-label="Direct link to Evaluating Chronosphere" title="Direct link to Evaluating Chronosphere" translate="no">​</a></h2>
<p>We decided to proceed with a PoC of Chronosphere. We started with some changes to our metrics pipeline infrastructure, adding <a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry Collector</a>s to help capture and redirect our current metrics data (which was coming mostly from telegraf), so that we could send metrics to both our in-house InfluxDB and Chronosphere concurrently.</p>
<p>We had, as part of a previous set of experiments, already set up some common Prometheus metrics infrastructure in our Kubernetes clusters, including kube-state-metrics, node-exporter, and cadvisor. We were able to easily point these at Chronosphere as well.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-sheer-volume-of-metrics">The sheer volume of metrics<a href="https://labaneilers.com/prometheus-vendor-death-match#the-sheer-volume-of-metrics" class="hash-link" aria-label="Direct link to The sheer volume of metrics" title="Direct link to The sheer volume of metrics" translate="no">​</a></h3>
<p>The first thing we realized was that we were sitting on an <em>enormous</em> amount of cardinality. Chronosphere reported that we were generating over <strong>8 million active series</strong>, and our sales engineers were a flabbergasted about how we were even able to handle all of it with a single InfluxDB server.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Fun fact</div><div class="admonitionContent_BuS1"><p>Actually, every vendor we talked to was certain we were mistaken when reported to them we were handling 8-9 million active series in a single InfluxDB; they assured us that this wasn't possible.</p><p>And yet, somehow we were doing it.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-chronosphere-control-plane">The Chronosphere control plane<a href="https://labaneilers.com/prometheus-vendor-death-match#the-chronosphere-control-plane" class="hash-link" aria-label="Direct link to The Chronosphere control plane" title="Direct link to The Chronosphere control plane" translate="no">​</a></h3>
<p>Our sales engineers immediately got to work helping us learn how to use their "control plane" feature, which allows you to write fairly arbitrary rules which can select metrics by virtually any criteria (names, label values, and/or combinations based on boolean expressions), and perform complex transformations on them, including:</p>
<ul>
<li class="">Drop them entirely</li>
<li class="">Aggregate away high-cardinality labels</li>
<li class="">More complex transformations, such as changing the temporality of the counters (e.g. change a "delta" counter to a "cumulative" counter)</li>
</ul>
<p>It was immediately clear that their control plane was <strong>extremely powerful</strong>. We did a bit of analysis on the highest cardinality metrics coming in, and by cross-referencing them with a JSON export of our existing Grafana dashboards, were able to create a relatively small number of rules that reduced our cardinality by about 60%. It took a bit more work to go farther, but we eventually got our cardinality down to about ~1.7 million active series.</p>
<p>At this point, since we had a handle on our active series volume, Chronosphere's sales folks gave us an initial price, which turned out to be very reasonable.</p>
<p>Holy cow, it looked like this just might work!</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="using-chronosphere">Using Chronosphere<a href="https://labaneilers.com/prometheus-vendor-death-match#using-chronosphere" class="hash-link" aria-label="Direct link to Using Chronosphere" title="Direct link to Using Chronosphere" translate="no">​</a></h3>
<p>Once we had addressed the initial concerns around affordability, we got to work evaluating the product's overall fit. We had a bunch of teams convert some of their InfluxDB-backed dashboards and alerts over to Chronosphere, and started to get a feel for how it would be to use it day-to-day.</p>
<p>Since Chronosphere's UI was based on Grafana (v7), it turned out to be very similar to our self-managed Grafana/InfluxDB from a developer perspective, with the main differences being:</p>
<ul>
<li class="">The PromQL language</li>
<li class="">Much better query performance</li>
</ul>
<p>After a few weeks of playing with the product, we were satisfied it would do the job. We gave it the thumbs up.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="evaluating-grafana-cloud">Evaluating Grafana Cloud<a href="https://labaneilers.com/prometheus-vendor-death-match#evaluating-grafana-cloud" class="hash-link" aria-label="Direct link to Evaluating Grafana Cloud" title="Direct link to Evaluating Grafana Cloud" translate="no">​</a></h2>
<p>Initially, we had sort of written off Grafana Cloud, since the price they gave us originally, based on our active series in InfluxDB, was in the same range as New Relic and DataDog. However, this was before we realized that they had a feature that was similar to Chronosphere's control plane, called <a href="https://grafana.com/products/cloud/metrics/prometheus-cardinality-optimization/" target="_blank" rel="noopener noreferrer" class="">Adaptive Metrics</a>.</p>
<p>We told the Grafana sales team that, using Chronosphere, we'd been able to reduce our metrics to under 2 million active series, and asked for a new quote based on the assumption we could use Adaptive Metrics to get similar results in Grafana Cloud.</p>
<p>They came back with a price that was almost exactly the same as Chronosphere. The race was back on!</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="using-adaptive-metrics">Using Adaptive Metrics<a href="https://labaneilers.com/prometheus-vendor-death-match#using-adaptive-metrics" class="hash-link" aria-label="Direct link to Using Adaptive Metrics" title="Direct link to Using Adaptive Metrics" translate="no">​</a></h3>
<p>Once we updated our metrics pipeline to export to Grafana Cloud, and had a chance to start playing with Adaptive Metrics, we were disappointed to find that it wasn't nearly as powerful as Chronosphere's control plane. The biggest difference was that you could only target metrics based on their names, and not their labels or values. This was a big limitation, as we had a lot rules we had written in Chronosphere that did things like:</p>
<ul>
<li class="">Drop all metrics from a specific service, except for a few key ones</li>
<li class="">Drop a particular metric generated by a telegraf plugin (e.g. procstat or diskio), but not for services in an "allowlist"</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="but-we-really-really-liked-grafana-cloud">But we really, really liked Grafana Cloud<a href="https://labaneilers.com/prometheus-vendor-death-match#but-we-really-really-liked-grafana-cloud" class="hash-link" aria-label="Direct link to But we really, really liked Grafana Cloud" title="Direct link to But we really, really liked Grafana Cloud" translate="no">​</a></h3>
<p>Aside from cardinality management, where Chronosphere clearly had the lead, we found a lot of areas where we preferred Grafana Cloud:</p>
<ul>
<li class="">They had a more modern, polished user experience (both used Grafana as a front-end, but Grafana Cloud has the latest version, while Chronosphere's was pinned to v7, which is very old)</li>
<li class="">Their documentation was significantly better</li>
<li class="">They had support for multiple data sources, including CloudWatch, ElasticSearch, and Athena (which were important to us)</li>
<li class="">They were strong leaders in the open source observability community</li>
<li class="">Grafana Labs was a larger and more established company, with a more robust and mature product portfolio</li>
<li class="">It seemed credible that we may eventually be able to migrate our traces and logs to them as well, giving us a unified observability platform</li>
</ul>
<p>It was clear that, besides the discrepancy in cardinality management, we'd prefer to go with Grafana Cloud. However, if we wanted to make this work, we'd need to find a way to handle the use cases that adaptive metrics wouldn't cover.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="taking-another-look-at-the-otel-collector">Taking another look at the OTEL collector<a href="https://labaneilers.com/prometheus-vendor-death-match#taking-another-look-at-the-otel-collector" class="hash-link" aria-label="Direct link to Taking another look at the OTEL collector" title="Direct link to Taking another look at the OTEL collector" translate="no">​</a></h3>
<p>It turns out that the <a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OTEL Collector</a> (which I mentioned we were already using) is an insanely useful swiss-army knife for building observability pipelines. It can collect metrics, traces, and logs in virtually any format, run a pipeline of transformations, and output them in virtually any other format.</p>
<p>I knew that the OTEL collector had a number of <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor" target="_blank" rel="noopener noreferrer" class="">processors</a> available, though we hadn't used them much previously. I wondered if we could use these to replicate some of the more advanced metrics selector functionality that Chronosphere offered.</p>
<p>It took me a bit of time to figure it all out, mostly because the OTEL collector documentation isn't amazing, but I was eventually able to replicate pretty much all of the advanced "drop" rules we needed using the OTEL collector's processors.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Check it out</div><div class="admonitionContent_BuS1"><p>Check out <a class="" href="https://labaneilers.com/fun-with-otel-collectors-and-metrics">some of the tricks I used replicate some of Chronosphere's drop rule features in the OTEL collector</a></p></div></div>
<p>In the end, with the combination of Grafana Cloud's adaptive metrics and the OTEL collector processors, we were able to get our total cardinality down to a similar level as we had with Chronosphere. While the resulting solution was a bit more complicated, it was an acceptable tradeoff given the other advantages of Grafana Cloud.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://labaneilers.com/prometheus-vendor-death-match#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>The experience of running a head-to-head evaluation of two vendors, especially given the penetration of OpenTelemetry and Prometheus in the market, was a real eye-opener. I'm more bullish than ever on OTEL (and cloud-native standardization initiatives in general), and I think its going to continue to reshape the observability landscape in the coming years.</p>
<p>I should point out that, even though we selected Grafana Cloud, I think Chronosphere would have also been an excellent choice. I think it might even be a better choice for a company that meets a few criteria:</p>
<ul>
<li class="">Your biggest pain point is cardinality and/or cost management</li>
<li class="">You have a large number of metrics producers that would be hard to corral into a uniform schema</li>
<li class="">You don't have a lot of third party metrics sources (e.g. CloudWatch, ElasticSearch) that you want to query directly (Chronosphere integrates with those data sources by eagerly scraping them and converting them to Prometheus metrics... which can increase costs for sources, like CloudWatch, that charge by the API call)</li>
<li class="">You're OK with a slightly less polished user experience (or you're willing to wait for Chronosphere to catch up)</li>
</ul>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Confession</div><div class="admonitionContent_BuS1"><p>The sales engineering team at Chronosphere was <em>absolutely amazing</em>. They put in a <em>ton</em> of work helping me adapt our existing Influx-centric pipeline to work with Prometheus and OTEL. Plus they had to put up with me, who required a remedial education in Prometheus concepts before we could do anything.</p><p>They were so patient, knowledgeable, and great to work with, I feel legitimately <em>terrible</em> (on a personal level) we eventually decided to go with a competitor.</p></div></div>
<p>That said, Grafana Cloud has been a great fit for us. Their support and customer success teams, in particular, have been really effective in helping get our team ramped up and successful. Given this experience, we're interested in expanding our use into their logging (Loki) and tracing (Tempo) products. I'll let you know how that goes.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="addendum">Addendum<a href="https://labaneilers.com/prometheus-vendor-death-match#addendum" class="hash-link" aria-label="Direct link to Addendum" title="Direct link to Addendum" translate="no">​</a></h3>
<p>I reached out to both Grafana Labs and Chronosphere with a draft of this post. I'm glad I did, because Chronosphere let me know that due to feedback like ours, they've been investing in some of the areas in which they were weakest relative to Grafana Cloud, namely UI quality:</p>
<ul>
<li class=""><a href="https://chronosphere.io/learn/chronosphere-bolsters-oss-contributions-for-perses-otel/" target="_blank" rel="noopener noreferrer" class="">https://chronosphere.io/learn/chronosphere-bolsters-oss-contributions-for-perses-otel/</a></li>
<li class=""><a href="https://chronosphere.io/resource/chronosphere-lens-solutions-brief-pdf/" target="_blank" rel="noopener noreferrer" class="">https://chronosphere.io/resource/chronosphere-lens-solutions-brief-pdf/</a></li>
</ul>
<p>They're the primary force behind <a href="https://github.com/perses/perses" target="_blank" rel="noopener noreferrer" class="">Perses</a>, which a competitor for OSS Grafana (which Chronosphere was previously using for visualizations). They weren't specific about the details, but my guess is the monolithic design of Grafana, combined with its <a href="https://www.gnu.org/licenses/agpl-3.0.en.html" target="_blank" rel="noopener noreferrer" class="">AGPL license</a>, limited their ability to integrate it effectively into their product without having their proprietary UI be infected with the AGPL redistribution terms. Perses is permissively licensed (Apache 2) and backed by the Linux Foundation.</p>
<p>It looks like it's designed to be modular and embedable, as well as be more IaC/GitOps-friendly than Grafana. The project is very young, but I'm excited to see some more open-source visualization options available.</p>]]></content:encoded>
            <category>opentelemetry</category>
            <category>devops</category>
            <category>platform-engineering</category>
            <category>observability</category>
        </item>
        <item>
            <title><![CDATA[Fun with OTEL collectors and metrics]]></title>
            <link>https://labaneilers.com/fun-with-otel-collectors-and-metrics</link>
            <guid>https://labaneilers.com/fun-with-otel-collectors-and-metrics</guid>
            <pubDate>Sat, 11 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[As part of an evaluation of Prometheus compatible monitoring solutions, I found the need to push our use of the OTEL Collector to handle some use cases like creating metrics allowlists, renaming metrics, or adding and modifying labels.]]></description>
            <content:encoded><![CDATA[<img src="https://labaneilers.com/assets/images/otel-logo-f623e3ec0a4ae9e63e181d96f85dbfa9.png" class="blog-image" alt="OpenTelemetry Logo">
<p>As part of an <a class="" href="https://labaneilers.com/prometheus-vendor-death-match">evaluation of Prometheus compatible monitoring solutions</a>, I found the need to push our use of the <a href="https://opentelemetry.io/docs/collector/" target="_blank" rel="noopener noreferrer" class="">OTEL Collector</a> to handle some use cases like creating metrics allowlists, renaming metrics, or adding and modifying labels.</p>
<p>Here's some examples, based on what I learned, of the crazy and powerful things you can do with OTEL collector <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor" target="_blank" rel="noopener noreferrer" class="">processors</a> to manipulate metrics.</p>
<!-- -->
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="inserting-static-labels">Inserting static labels<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#inserting-static-labels" class="hash-link" aria-label="Direct link to Inserting static labels" title="Direct link to Inserting static labels" translate="no">​</a></h3>
<p>As part of a multi-account AWS strategy, we have many Kubernetes clusters, spread across AWS accounts for each of our teams. We wanted to make sure that all metrics coming from Kubernetes clusters contain labels with metadata about which cluster and account they came from (beyond what comes with the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/k8sattributesprocessor/README.md" target="_blank" rel="noopener noreferrer" class="">k8sattributes processor</a>).</p>
<p>We use the OTEL collector as a daemonset (so it runs on all nodes in our clusters), and all telemetry from our pods goes through them.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>NOTE</div><div class="admonitionContent_BuS1"><p>Since these OTEL collectors are deployed in Kubernetes, we can inject environment variables into the pods with this static information (in our case these environment variables are set via Terraform).</p></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">attributes/cluster-metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">actions</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> upsert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"${CLUSTER_ENV}"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> env</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> upsert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"${CLUSTER_LABEL}"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster_label</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> upsert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"${CLUSTER_NAME}"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> upsert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"${CLUSTER_TEAM}"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> team</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> upsert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"${CLUSTER_REGION}"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> region</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="inserting-dynamic-labels">Inserting dynamic labels<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#inserting-dynamic-labels" class="hash-link" aria-label="Direct link to Inserting dynamic labels" title="Direct link to Inserting dynamic labels" translate="no">​</a></h3>
<p>We have a bunch of legacy services, deployed outside Kubernetes, that don't have an <code>instance</code> label (which is idiomatic in Prometheus). These metrics (generated by <a href="https://github.com/influxdata/telegraf" target="_blank" rel="noopener noreferrer" class="">telegraf</a>) do have a <code>host</code> label, however, so we used that to create an <code>instance</code> label, also using the <code>attributes</code> processor:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">attributes/instance-label</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">actions</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> insert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">from_attribute</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> host</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> instance</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="replacing-useless-labels-with-useful-ones">Replacing useless labels with useful ones<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#replacing-useless-labels-with-useful-ones" class="hash-link" aria-label="Direct link to Replacing useless labels with useful ones" title="Direct link to Replacing useless labels with useful ones" translate="no">​</a></h3>
<p>When we scrape <a href="https://github.com/kubernetes/kube-state-metrics" target="_blank" rel="noopener noreferrer" class="">kube-state-metrics</a>, the <code>pod</code> and <code>namespace</code> labels on the metrics are the pod and namespace of the kube-state-metrics pod itself. This isn't so useful; we don't care about the kube-state-metrics pod names, we only care about the pods that are the subject of the metrics.</p>
<p>Here's a trick where we use the <code>attributes</code> processor to remove the kube-state-metric pod/namespace labels, and then rename the exported pod/namespace labels to replace them.</p>
<p>This way, when users are querying for metrics on their pod, they can just use the <code>pod</code> label, and don't have to worry about the implementation details of how kube-state-metrics is scraped:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Delete the pod and namespace labels which refer to the kube-state-metrics</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># pod itself, not the pods the metrics refer to.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">attributes/kube-state-metrics</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">include</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">match_type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> regexp</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">metric_names</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"^kube_.+$"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">actions</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> delete</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> pod</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> delete</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> namespace</span><br></span></code></pre></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Rename the exported_pod and exported_namespace labels to pod and namespace</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metricstransform/kube-state-metrics</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">transforms</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">include</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"^kube_.*$"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">match_type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> regexp</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> update</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">operations</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> update_label</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">label</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> exported_namespace</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">new_label</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> namespace</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> update_label</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">label</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> exported_pod</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">new_label</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> pod</span><br></span></code></pre></div></div>
<p>You'll need to make sure that the <code>attributes/kube-state-metrics</code> processor runs before the <code>metricstransform/kube-state-metrics</code> processor in your pipeline, so that the old labels are deleted before the new ones are renamed.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="renaming-metrics">Renaming metrics<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#renaming-metrics" class="hash-link" aria-label="Direct link to Renaming metrics" title="Direct link to Renaming metrics" translate="no">​</a></h3>
<p>Sometimes, we'd find older services had metrics that had been named in various problematic ways, so we wanted a way to rename metrics (e.g. to adhere to a naming convention). Here's a use of the <code>metricstransform</code> processor that renames all metrics with a <code>badsuffix</code> to have a <code>goodsuffix</code> instead:</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>NOTE</div><div class="admonitionContent_BuS1"><p>The double dollar sign (<code>$$</code>) is intentional; the OTEL collector would interpret <code>${1}</code> as an environment variable. The second <code>$</code> escapes the first, so that it's interpreted as a literal <code>$</code>, and used as part of the regular expression capture group.</p></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">metricstransform/fix-suffix</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">transforms</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">include</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ^(.</span><span class="token important">*)_badsuffix$</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">match_type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> regexp</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">action</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> update</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">new_name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"$${1}_goodsuffix"</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="truncating-long-label-values">Truncating long label values<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#truncating-long-label-values" class="hash-link" aria-label="Direct link to Truncating long label values" title="Direct link to Truncating long label values" translate="no">​</a></h3>
<p>Grafana Cloud has a maximum label length of 1024 characters. Any metrics with labels exceeding this length will be dropped before they're ingested. Here's a nifty transform that truncates all label values this length:</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>Why would anyone have a label value that long? Well, there's no <em>good</em> reason. But sometimes, <em>just sometimes</em>, a distracted programmer may accidentally include an entire stack trace as a label value.</p><p>Not naming names.</p></div></div>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">transform/truncate-labels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">metric_statements</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">context</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> datapoint</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">statements</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> truncate_all(attributes</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> 1024)</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="filtering-metrics-with-arbitrary-queries">Filtering metrics with arbitrary queries<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#filtering-metrics-with-arbitrary-queries" class="hash-link" aria-label="Direct link to Filtering metrics with arbitrary queries" title="Direct link to Filtering metrics with arbitrary queries" translate="no">​</a></h3>
<p>Here's where we get to the use case that we really needed: a way to drop large swaths of metrics entirely, based on arbitrary queries. The most common pattern we were trying to replicate was an "allowlist", where we drop most metrics, except for those that meet some criteria.</p>
<p>The OTEL collector has a <code>filter</code> processor to do this, and it supports the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/README.md" target="_blank" rel="noopener noreferrer" class="">OpenTelemetry Transformation Language</a> (OTTL), which allows you to write complex expressions to represent your filter criteria:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">filter/drop-rules</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">error_mode</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ignore</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">metrics</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">datapoint</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># An "allowlist" that drops all metrics from 'some-service' except for</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># the two specified.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">&gt;</span><span class="token scalar string" style="color:#e3116c"></span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      resource.attributes["service.name"] == "some-service" and</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      metric.name != "some_service_important_metric" and</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      metric.name != "up"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># A similar allowlist, but using a regex to match the service name</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># for various sized fluent-bit daemonset pods (we have small, medium, </span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># and large variants of fluent-bit).</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">&gt;</span><span class="token scalar string" style="color:#e3116c"></span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      IsMatch(resource.attributes["service.name"], "fluent-bit-.*") and</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      metric.name != "fluentbit_output_dropped_records" and</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      metric.name != "up"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># A drop rule that drops a specific cadvisor metric for all services</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># except for those in a specific namespace.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">&gt;</span><span class="token scalar string" style="color:#e3116c"></span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      resource.attributes["service.name"] == "cadvisor" and</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      metric.name == "container_file_descriptors" and</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      (not IsMatch(attributes["namespace"], "^someservice.*$"))</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># A drop rule that shows you can use more complex boolean expressions</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># with parentheses to group conditions.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">&gt;</span><span class="token scalar string" style="color:#e3116c"></span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      attributes["telegraf"] == "1" and (</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">        IsMatch(metric.name, "^internal_(agent|gather|memstats|serializer|statsd|write)_.*") or</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">        IsMatch(metric.name, ".+[-_]request[-_]metrics[-_](median|sum)$") or</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">        IsMatch(metric.name, ".+_stddev$")</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      )</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Another complex drop rule, where we're dropping metrics matching a</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># regex for all services, but with a list of exceptions.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">&gt;</span><span class="token scalar string" style="color:#e3116c"></span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      IsMatch(metric.name, ".+[-_]request[-_]metrics[-_]upper$") and not (</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">        attributes["service"] == "service1" or</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">        attributes["service"] == "service2" or </span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">        metric.name == "inconsistently_named_service_request_metrics_upper"</span><br></span><span class="token-line" style="color:#393A34"><span class="token scalar string" style="color:#e3116c">      )</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="limitations-of-the-otel-collector">Limitations of the OTEL collector<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#limitations-of-the-otel-collector" class="hash-link" aria-label="Direct link to Limitations of the OTEL collector" title="Direct link to Limitations of the OTEL collector" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="it-cant-do-aggregations">It can't do aggregations<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#it-cant-do-aggregations" class="hash-link" aria-label="Direct link to It can't do aggregations" title="Direct link to It can't do aggregations" translate="no">​</a></h3>
<p>While the OTEL collector has many powerful processors, it doesn't currently have the ability to do aggregations (i.e. drop a particular label from a metric and create a new metric by aggregating the metrics that had that label). This is a much harder problem to solve than just dropping metrics, since all the OTEL collector instances that could process any metric you'd want to aggregate would need to coordinate with each other, creating some scaling challenges.</p>
<p>Both <a href="https://grafana.com/products/cloud" target="_blank" rel="noopener noreferrer" class="">Grafana Cloud</a> and <a href="https://chronosphere.io/" target="_blank" rel="noopener noreferrer" class="">Chronosphere</a> offer powerful features around metrics aggregation.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="it-cant-convert-delta-to-cumulative-counters">It can't convert delta to cumulative counters<a href="https://labaneilers.com/fun-with-otel-collectors-and-metrics#it-cant-convert-delta-to-cumulative-counters" class="hash-link" aria-label="Direct link to It can't convert delta to cumulative counters" title="Direct link to It can't convert delta to cumulative counters" translate="no">​</a></h3>
<p>When I evaluated Chronosphere, I was delighted to find that it had a feature to change the temporality of metrics (e.g. change a "delta" counter to a "cumulative" counter), and I was hoping to replicate it with the OTEL collector. While the OTEL collector does have a <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/deltatocumulativeprocessor" target="_blank" rel="noopener noreferrer" class="">converter processor in the works</a>, it's still early in development.</p>
<p>In our case, we were able to work around this with some hackery in telegraf (setting the <code>delete_counters</code> setting of the <a href="https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md" target="_blank" rel="noopener noreferrer" class="">statsd plugin</a> to <code>false</code>).</p>]]></content:encoded>
            <category>opentelemetry</category>
            <category>devops</category>
            <category>platform-engineering</category>
            <category>observability</category>
        </item>
        <item>
            <title><![CDATA[The impertinent programmer]]></title>
            <link>https://labaneilers.com/the-impertinent-programmer</link>
            <guid>https://labaneilers.com/the-impertinent-programmer</guid>
            <pubDate>Sun, 05 May 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[It was 2004, and I had a huge chip on my shoulder.]]></description>
            <content:encoded><![CDATA[<p>It was 2004, and I had a huge chip on my shoulder.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wait-you-need-some-background-first">Wait, you need some background first<a href="https://labaneilers.com/the-impertinent-programmer#wait-you-need-some-background-first" class="hash-link" aria-label="Direct link to Wait, you need some background first" title="Direct link to Wait, you need some background first" translate="no">​</a></h2>
<p>Let's back up for a minute... It was 2000, and I had been hired for my first actual job as a programmer at a company called Corex (makers of CardScan, a business card scanner). At this point I had a few years of employment under my belt, but this was as a graphic designer who did a little programming on the side. I was pretty clear about this in the interview for Corex, but I guess I did well enough on some programming problems (or there was some misunderstanding?) that my new boss was reasonably confident I could grow into a programmer who did a little graphic design.</p>
<p>They dropped me onto a team comprised entirely of people who had gone to engineering schools, wrote C++, and used words like "orthogonal". They gave me a web-oriented project, put me under the watchful eye of a cranky Russian PhD who would write COM objects that I could script against, and left me to decide how to glue this all together with a true gem of 2000-era web tech: ASP and VBScript.</p>
<p>I crammed books on programming most nights in bed, trying desperately to incorporate some high-level theory to make sense of the trial-and-error hacking I was doing during the day. The feeling of barely keeping myself from drowning eventually gave way to the sense of floating; I was making real, tangible progress, and I was having a ton of fun using a computer to put pixels on a screen.</p>
<p>Three years into this job, I got the opportunity to join a startup that had recently spun out of Corex called <a href="https://www.zoominfo.com/" target="_blank" rel="noopener noreferrer" class="">ZoomInfo</a> (they're still around). I'm not 100% certain exactly what number employee I was, but I was definitely in the first ten. Most folks on the team were roughly the same age as me, but like the previous job, they had all studied computer science, and knew things like whatever a "heapsort" is for. They were all C++ slingers too, and once again, I was the rookie who was hanging with the pros.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ok-so-back-to-2004-and-that-giant-chip">OK, so back to 2004 and that giant chip<a href="https://labaneilers.com/the-impertinent-programmer#ok-so-back-to-2004-and-that-giant-chip" class="hash-link" aria-label="Direct link to OK, so back to 2004 and that giant chip" title="Direct link to OK, so back to 2004 and that giant chip" translate="no">​</a></h2>
<p>At this point, despite my questionable background, I'd earned enough trust to be put in charge of a small team working on our B2B products. The main product was also built with ASP (this is still classic ASP; .NET existed, but was still the bleeding edge) with an era-appropriate amount of CSS and JavaScript. The app got its data from a massive SQL Server that had been populated by web crawlers, using some natural language processing. This was some legitimately groundbreaking stuff, written by a <a href="https://www.linkedin.com/in/michel-decary-61b7239b" target="_blank" rel="noopener noreferrer" class="">French Canadian genius</a> whose brilliance probably made the rest of the team feel like, well, how I felt around all of them.</p>
<p>The core UI of the product was pretty simple: you'd search for people by attributes (what industry they were in, what titles they may have had, what their field of expertise was), get a list of results, and click on one of them to get to the detail view.</p>
<p>I had inherited the implementation of this from some more senior members of the team a year previously, and had been slowly dolling it up, adding features and generally just making it look spiffier.</p>
<p>One thing had been bothering me since the first time I used the product, though: this detail view page took at least six or seven seconds to load, at a minimum. I mean, the page had a lot of great stuff on it, but it got annoying waiting for it on every click. It didn't seem to be a huge deal to anyone on the dev team, but the sales people occasionally grumbled about it.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Do you remember the 2000s?</div><div class="admonitionContent_BuS1"><p>Keep in mind, in 2004, a lot of us had just gotten our first cable modems, so we were used to going to get coffee while a big page loaded.</p></div></div>
<p>The page was built (as lots of ASP applications were at the time) out of C++ COM components, glued together into a UI with VBScript. With some very crude debugging, I could see that all the time was being spent in the C++ code, which I still had very little ability to wrangle.</p>
<p>I went to my boss (who may have been one of the original authors of said C++ component.. I'm not 100% sure), and asked about the performance bottleneck. He told me that C++ was the fastest, most efficient, and powerful language we had available, and that there was really not much that could be done to make this component any faster. The real engineers had already taken their pass at this, so I should let it go and get back to delivering that feature the sales team wanted.</p>
<p>This felt like a dam breaking; at this moment, I felt like I had endured the subtle condescension of these "real" engineers for 4 years now, and despite how much I had grown, I realized they were always going to view me as 2nd rate.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="programming-out-of-spite">Programming out of spite<a href="https://labaneilers.com/the-impertinent-programmer#programming-out-of-spite" class="hash-link" aria-label="Direct link to Programming out of spite" title="Direct link to Programming out of spite" translate="no">​</a></h2>
<p>I knocked out the feature the sales team wanted within an hour of this conversation, and instead of going back to JIRA for more work, I made the deliberate decision to carve out some time to prove my boss wrong.</p>
<p>First, I had to decipher this god forsaken C++ code. I was able to hack in some printf statements to log each SQL query as it ran, compiled it, and within another hour, was able to replicate the data access pattern in an interactive SQL Server query window. As expected, these queries did indeed take about 6-7 seconds to run.</p>
<p>This component had a fancy, object-oriented design, which elegantly encapsulated each row as an object which was responsible for fetching its own data. The effect was it was hitting the database like a machine gun- running <strong>a separate query for every row in the grid it was rendering</strong>. In most cases it was running like <strong>60 queries</strong> to render this page.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Fun fact</div><div class="admonitionContent_BuS1"><p>I didn't realize this yet, since C++ was inscrutable to me, but there was also no connection pooling configured for this C++ driver, so each query was establishing a new connection to the DB.</p></div></div>
<p>So I concocted 4 very, very, dumb queries to retrieve all the same information that the C++ component did, and ran it in the SQL Explorer. After a little bit of tuning indices, the whole thing ran in like 300 milliseconds.</p>
<p>At this point, I was pretty confident I was on to something, so I snuck a little time over the next couple days to wrap this all up into a VBScript function (with proper connection pooling) and replace the C++ component altogether. I fired up my localhost server, started clicking around from the search screen into the details pages... and it felt <strong>insanely fast</strong>!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="err-so-what-do-i-do-with-this-now">Err, so what do I do with this now?<a href="https://labaneilers.com/the-impertinent-programmer#err-so-what-do-i-do-with-this-now" class="hash-link" aria-label="Direct link to Err, so what do I do with this now?" title="Direct link to Err, so what do I do with this now?" translate="no">​</a></h2>
<p>At this point I was in a bind. In the sober light of day, I realized that a fit of rage, I had engaged in an <strong>unauthorized product improvement</strong>. I really wanted to show my boss, but I was honestly a bit afraid.</p>
<p>So I tested the water by showing a couple of my colleagues, who initially thought it was some kind of trick. Once they realized this was for real, there was no keeping the horse in the barn. Within a few minutes, my boss caught wind and came over to see what was going on.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Reality check</div><div class="admonitionContent_BuS1"><p>Hold up a second, though. Did I give you the impression that my boss was some caricature of a feckless micromanager? Remember, this story is from the perspective of a twenty-something who's been deranged by insecurity.</p><p>Truthfully, my boss was a lovely person, and had been a really important mentor to me. Seriously, he had <em>attended my wedding</em>, and we kept in touch for years after I left.</p><p>The tone here is mostly for effect, though does reflect my emotional truth at the time.</p></div></div>
<p>So once I explained what I did, he was actually fairly impressed, and called the CEO over to see it too. The CEO was a legitimately intimidating Israeli guy (who had worked in the Mossad), but even he had a hard time moderating his delight over this unexpected gift.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="looking-back-this-was-fairly-trivial">Looking back, this was fairly trivial<a href="https://labaneilers.com/the-impertinent-programmer#looking-back-this-was-fairly-trivial" class="hash-link" aria-label="Direct link to Looking back, this was fairly trivial" title="Direct link to Looking back, this was fairly trivial" translate="no">​</a></h2>
<p>In retrospect, this was a really elementary little exercise of basic engineering. I've done many more difficult and interesting things since, but this one still sticks out in my mind, because it was my youthful impertinence that pushed me to do something no one else thought I could do... or even try to do.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="20-years-later">20 years later...<a href="https://labaneilers.com/the-impertinent-programmer#20-years-later" class="hash-link" aria-label="Direct link to 20 years later..." title="Direct link to 20 years later..." translate="no">​</a></h2>
<p>Fast forward to just a few days ago.</p>
<p>I had a meeting scheduled with a guy at work who I hadn't met before. The invite said he had some questions about Helm and Kubernetes, so I thought, perfect, I can help.</p>
<p>Right out of the gate, the meeting went a little sideways, when I realized he was asking me for help finding a way to deploy his app to our Kubernetes cluster <em>without</em> using the client tooling that my team had built for this purpose. He was generally skeptical of any internal tools, and assumed they must obviously be inferior junk, and would just get in his way.</p>
<p>It took me a second, since I was self-aware enough to recognize I was feeling defensive, so I took a deep breath and asked some questions about his app. After a few minutes of questions, we had a pretty good understanding what he wanted. Before I did any advocacy, I clarified that I didn't believe in forcing our tools on anyone, and that he was free to make any decision that made sense to him and is team. But I asked him to take a few moments to listen while I listed some of the problems that were solved in our platform tooling that he would have to replicate if he decided not to use it.</p>
<p>This included stuff like IAM integration, Service/Ingress integration with AWS load balancers, cross-platform docker builds, configuration management, ephemeral environments, and test orchestration. I gave him a chance to ask some questions, to help him understand what he was getting himself into. The tide turned a little bit when, while arguing that IAM integration shouldn't be so hard, he said he could just inject some (long lived) AWS credentials into his pods. At this point, one of his colleagues realized he was advocating doing something that was totally bonkers (and a violation of our security policy).</p>
<p>After this, opened up a little and we were able to figure out that a lot of what he thought about our platform tooling were misconceptions, and it actually did all the things he wanted. He agreed he'd start out with our stuff, and let us know how it went.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-nerve-of-that-guy">The nerve of that guy!<a href="https://labaneilers.com/the-impertinent-programmer#the-nerve-of-that-guy" class="hash-link" aria-label="Direct link to The nerve of that guy!" title="Direct link to The nerve of that guy!" translate="no">​</a></h2>
<p>It took me about an hour to unwind from that meeting. I was so miffed! Our team's platform tooling has been wildly successful, and its been a couple years since we needed to do any proactive advocacy for it. Demand has been spreading mostly by word-of-mouth, as our developer teams have been really happy with it.</p>
<p>I paced my kitchen while obsessing over the interaction. How impertinent! Doesn't he know that I've already solved these problems? He just casually dismissed all the work my team has done over the last 2 years! He thinks its trivial; he'll just whip out some shell scripts to solve everything. It can't be that hard.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="recalling-the-virtues-of-impertinence">Recalling the virtues of impertinence<a href="https://labaneilers.com/the-impertinent-programmer#recalling-the-virtues-of-impertinence" class="hash-link" aria-label="Direct link to Recalling the virtues of impertinence" title="Direct link to Recalling the virtues of impertinence" translate="no">​</a></h2>
<p>At this moment I remembered my own experience as an impertinent young programmer, and my mind began to settle. I realized he was offering me a gift: the perspective of someone who, however naive, might have ideas or insight that I was missing. It had been a while since I'd faced this kind of skepticism, and I realized this was a good thing- it's important to have someone keep you on your toes.</p>
<p>This was a reminder that impertinence can be a virtue: a fuel to do cool stuff. Hopefully the next time I meet an impertinent upstart programmer, I'll smile and keep my thoughts to myself.</p>
<p>Or wait... maybe I should be really condescending to get them fired up? I'm gonna have to think about that.</p>]]></content:encoded>
            <category>platform-engineering</category>
            <category>programming</category>
            <category>storytime</category>
        </item>
        <item>
            <title><![CDATA[Finding my outside voice]]></title>
            <link>https://labaneilers.com/finding-my-outside-voice</link>
            <guid>https://labaneilers.com/finding-my-outside-voice</guid>
            <pubDate>Sun, 28 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[For most of my career, I've found that I tend to quickly develop a reputation in whatever company I'm working at. I've never been the best programmer, but I've got some breadth, creativity, and critical thinking skills, and I'm good at synthesis and communication. This has helped me see the big picture in moments where a novel idea was needed, and I was able to connect some talented people to come up with some cool stuff together.]]></description>
            <content:encoded><![CDATA[<figure class="blog-image"><a title="Shanmugamp7, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Small_pond.jpg"><img width="512" alt="Small pond" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/89/Small_pond.jpg/512px-Small_pond.jpg"></a></figure>
<p>For most of my career, I've found that I tend to quickly develop a reputation in whatever company I'm working at. I've never been the best programmer, but I've got some breadth, creativity, and critical thinking skills, and I'm good at synthesis and communication. This has helped me see the big picture in moments where a novel idea was needed, and I was able to connect some talented people to come up with some cool stuff together.</p>
<!-- -->
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-small-pond">The small pond<a href="https://labaneilers.com/finding-my-outside-voice#the-small-pond" class="hash-link" aria-label="Direct link to The small pond" title="Direct link to The small pond" translate="no">​</a></h2>
<p>Because of this, people have mostly taken me seriously within my companies. In my work as both a platform engineer as well as in engineering leadership, this has come in pretty handy. I do a lot of internal communications, mostly with developers, but also across organizations, including occasionally the C-suite, where the objective is often to influence behavior in some way. For example:</p>
<ul>
<li class="">I have to convince team managers to commit some of their team's time to test out a new observability vendor and get me feedback before a purchasing deadline</li>
<li class="">I need to convince a product owner that if their team spends some time migrating their app to our new platform, it will pay off by making their team faster and their product more reliable</li>
<li class="">I need to convince developers they need to adopt some new front-end standards, because mobile web is actually a thing now, and we need to make sure our site works on phones (lol, this one is a bit dated now)</li>
<li class="">I need to convince the CTO that an assumption they made isn't correct, and we need to pivot ASAP</li>
</ul>
<p>In all these kinds of communications, I usually start with writing. Not only does it help me clarify my own thoughts, but I feel like its a conscientious way to engage with someone when you're obviously trying to influence them. It gives them a chance to read, absorb, and process, without being put on the spot. Then it can be a lot easier to have subsequent conversations.</p>
<p>Writing is an even more essential tool for communicating to a larger audience, and interestingly, the same dynamic applies. Sending an email to a whole division of a company has a slightly different flavor than to an individual, but I've found that the more you treat it like an invitation for further conversation, the more effective it is at actually influencing behavior.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="my-inside-voice">My inside voice<a href="https://labaneilers.com/finding-my-outside-voice#my-inside-voice" class="hash-link" aria-label="Direct link to My inside voice" title="Direct link to My inside voice" translate="no">​</a></h2>
<p>So I ended up writing a lot of internal documentation, emails (and increasingly more multi-paragraph slacks), Confluence articles, and occasionally some slides for a presentation. In all of these, I've found a particular voice that feels appropriately authoritative, but also approachable and informal, with a bit of self-deprecating humor, and the occasional pop culture reference thrown in.</p>
<p>These communications have always been relatively well received, and usually effective. I've also found that developers tend to read my emails at a much higher rate than those from others (sometimes more than those from senior executives).</p>
<p>But at some level, I've always known that a big part of this is my reputation and pre-existing standing in the company, and I'm honestly not sure how big. It's easy to be confident about my communication when I know a big chunk of the people in my audience personally, and understand their context, concerns, and environment.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="finding-my-outside-voice-is-a-little-scary">Finding my outside voice is a little scary<a href="https://labaneilers.com/finding-my-outside-voice#finding-my-outside-voice-is-a-little-scary" class="hash-link" aria-label="Direct link to Finding my outside voice is a little scary" title="Direct link to Finding my outside voice is a little scary" translate="no">​</a></h2>
<p>I've been procrastinating on starting this blog for over 15 years. I wrote my first entry a few months before my youngest child was born... and she's now in high school. I have to admit it's partially because I'm a bit nervous about leaving my comfortable little corporate bubble where everyone knows me.</p>
<p>Part of me thinks all these "great insights" I've been wanting to share for years are going to get absolutely shredded in the daylight, given the tech world is filled with brilliant people, and on any single topic I think I know anything about, there's someone who knows a whole lot more. Perhaps I'm absolutely full of crap, or perhaps my insights are just boring and obvious.</p>
<p>Something changed in the past few months, and the urge to start sharing some ideas has finally overcome my fear of getting pilloried. I'm finally getting to the point where, however masochistic it may be, <strong>I'm really starting to crave external feedback</strong>. I want a way test out some of these crazy ideas in the open, where I can see how a more diverse and knowledgeable population reacts to them.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mustering-the-energy">Mustering the energy<a href="https://labaneilers.com/finding-my-outside-voice#mustering-the-energy" class="hash-link" aria-label="Direct link to Mustering the energy" title="Direct link to Mustering the energy" translate="no">​</a></h2>
<p>The other obstacle was finding the time and energy to do the technical work to get this thing up and running. My site used to be on Vistaprint's website platform (which I helped build back in the day, so it was free for me), and had since been (forcibly) migrated to Wix. Frankly, working with Wix... did not make me happy.</p>
<p>At my current job, I finally found a <a href="https://docusaurus.io/" target="_blank" rel="noopener noreferrer" class="">static site generator</a> that I like a lot for documentation, so that gave me the push I needed to export my stuff out of a languishing Wordpress, and get it published to Github Pages.</p>
<p>At some point soon I have to figure out SEO, and probably get over my disdain for social media and start putting myself out there.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://labaneilers.com/finding-my-outside-voice#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>I remember as a kid, I was sometimes loud, and my teachers and parents would occasionally admonish me to use my inside voice. Perhaps it was good advice at the time. For me, though, I think it's finally time to find a voice for speaking outside.</p>
<p>Hopefully those mean kids across the street won't throw rocks at me again.</p>]]></content:encoded>
            <category>writing</category>
        </item>
        <item>
            <title><![CDATA[Let a thousand flowers bloom]]></title>
            <link>https://labaneilers.com/let-a-thousand-flowers-bloom</link>
            <guid>https://labaneilers.com/let-a-thousand-flowers-bloom</guid>
            <pubDate>Sat, 27 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Experiments in progressive engineering management]]></description>
            <content:encoded><![CDATA[<img src="https://labaneilers.com/assets/images/nasturtium-f07b4c5dbec595ccdf330d464cac0492.jpg" class="blog-image" alt="Less than a thousand flowers">
<p>I've been reflecting recently on a really formative period in my career, when I had a chance to be part of a massive experiment in progressive engineering management.</p>
<p>About 3 and a half years before I left Vistaprint, I was asked to join the Engineering leadership team by our (relatively) new VP of Engineering, Erin DeCesare (who is now the <a href="https://www.ezcater.com/company/team/erin-decesare/" target="_blank" rel="noopener noreferrer" class="">CTO of EZCater</a>). She was a particularly bold leader in terms of her progressive management ideas, and was rapidly reshaping the organization with a strong set of values around empowerment and servant leadership.</p>
<!-- -->
<p>We were responsible for about 200 developers, who were organized into squads (a.k.a. two-pizza teams), who were then loosely grouped into "tribes" (an idea borrowed from <a href="https://www.atlassian.com/agile/agile-at-scale/spotify" target="_blank" rel="noopener noreferrer" class="">Spotify</a>). The real difference from the previous regime was a pretty extreme amount of autonomy given to teams; they could choose their own technologies, work processes, architectures.</p>
<p>On top of this, Erin was pushing farther into some even more progressive empowerment concepts. For example:</p>
<ul>
<li class="">Managers were given a lot of coaching on servant leadership, and the ones who weren't able to evolve were managed out</li>
<li class="">Teams would be supported by embedded agile coaches, who helped them optimize team health properties, such as psychological safety and culture of feedback</li>
<li class="">Teams were given the freedom to decide what work to pull from our enterprise backlog</li>
<li class="">Teams would decide how to distribute bonuses between them (this one went a bit sideways)</li>
</ul>
<p>All this stuff was a bit mindblowing to me, but I was doing my best to commit to making it all work. It was an extraordinary amount of cat-herding, managing through influence, and general chaos, but it felt crazy enough that it just might work.</p>
<p>We set up some initial organizational mechanisms to attempt to manage all this chaos, since the whole point was to build a better team that could deliver features and products that made our customers happy. We set up feedback loops, elevated thoughtful and visionary people, and set up structures to help make sane architectural decisions.</p>
<p>Then, things got very interesting.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="then-we-watched-the-experiments-unfold">Then we watched the experiments unfold<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#then-we-watched-the-experiments-unfold" class="hash-link" aria-label="Direct link to Then we watched the experiments unfold" title="Direct link to Then we watched the experiments unfold" translate="no">​</a></h2>
<p>We had something like 30 squads, each running themselves in almost any way they saw fit. There was a lot of diversity between teams; folks with different backgrounds, different technology preferences, stronger or weaker opinions, different seniority levels, etc.</p>
<p>What I witnessed was 30 different teams running 30 different experiments into what makes a team successful... or not.</p>
<p>Some examples:</p>
<ul>
<li class="">Some teams stuck conservatively to working in our monolith, and did a release every 3 weeks, others spun up Kubernetes clusters in AWS with KOPS and deployed their apps via helm charts several times a day.</li>
<li class="">Some were absolutely religious about automated testing, and obsessive about code coverage, others had a more deliberate risk-management strategy.</li>
<li class="">Some teams spent a long time getting observability working, and others relied more on signals from external SRE and QA.</li>
<li class="">Some worked via consensus and mostly made decisions together, others had one or two very senior folks which set the direction for the team</li>
<li class="">Some teams did all their work in pairs, or via "mobbing", while others only interacted with their teammates when they were stuck.</li>
</ul>
<p>There was also a lot of variation in the adoption of the agile philosophies we were coaching the teams into adopting.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="things-i-noticed">Things I noticed<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#things-i-noticed" class="hash-link" aria-label="Direct link to Things I noticed" title="Direct link to Things I noticed" translate="no">​</a></h2>
<p>After running in this mode for about 9 months, there were a few themes I noticed which have really informed my perspective today:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="engineers-will-fill-all-available-space-with-engineering-work">Engineers will fill all available space with engineering work<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#engineers-will-fill-all-available-space-with-engineering-work" class="hash-link" aria-label="Direct link to Engineers will fill all available space with engineering work" title="Direct link to Engineers will fill all available space with engineering work" translate="no">​</a></h3>
<p>One of the "tribes" (4 squads, around 25 people) was given an objective, some constraints, and 6-12 months to deliver a new platform for managing our product catalog. We were replacing our "legacy" product system because it had grown too creaky and complex over the years, and we wanted a bunch of new features, especially for marketers to be able to manage content without needing engineering time.</p>
<p>Yeah, I know, this is a classic case of <a href="https://en.wikipedia.org/wiki/Second-system_effect" target="_blank" rel="noopener noreferrer" class="">second-system syndrome</a>, but everything just seemed so <em>obvious</em> to us at the time;  we were smart people, and it seemed achievable to us. So we did some design, broke the work up and gave pieces to each of the squads.</p>
<p>As it turns out, our teams invented engineering problems to fill all available space. They created an elaborate network of microservices, built SPAs out of hip new JavaScript frameworks, integrated some truly abominable "enterprise" vendor products, and designed massively complex processes and workflows to solve every use case that we failed to account for in the legacy system.</p>
<p>It started to hit me that we had jumped the shark when I noticed that a microservice one of the teams built could have been implemented as a single text file. Once that clicked, I started realizing, to my horror, that everywhere I looked, the entire system was like this. And then, through the haze of the groupthink, it occurred to me:</p>
<p><strong>We'd have been better off giving this entire objective to a team of 5 people for a month</strong>.</p>
<p>A small, constrained team would have had no choice but to build something small and simple that worked. Then they would have had to evolve it incrementally, which as we now all know, is the <a href="http://principles-wiki.net/principles:gall_s_law" target="_blank" rel="noopener noreferrer" class="">only way to build a working system</a>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="agile-can-be-so-powerful-or-so-horrible">Agile can be so powerful, or so horrible<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#agile-can-be-so-powerful-or-so-horrible" class="hash-link" aria-label="Direct link to Agile can be so powerful, or so horrible" title="Direct link to Agile can be so powerful, or so horrible" translate="no">​</a></h3>
<p>Some teams got really religious about their agile methodology (usually Scrum), and used agile "rules" as a weapon against their PO, manager, and occasionally each other. They'd spend a lot more time playing games with ticket management, ceremonies, and storypoints than they did thinking about customers, products, or using their common sense.</p>
<p>One of the best defenses against teams going this direction was having great agile coaches (NOT the high priests from big consultancies who spend their time promoting seminars on LinkedIn). A good agile coach can provide a sort of continous intervention for a team; they hold up a mirror, allow them see their own disfunction, and help re-center them on what matters.</p>
<p>From observing these coaches in action, I formed two separate beliefs:</p>
<ul>
<li class="">
<p>I've noticed that <strong>great agile coaches tend to also be very product-centered</strong>, and are genuinely interested in developing technical and domain knowledge from the team they're working with. They ask a lot of questions, and develop insights that are specific to the practical limitations the team is managing. They don't just waltz in and start unloading a bunch of dogma.</p>
</li>
<li class="">
<p>Its interesting that <strong>when you see a healthy, high-performing team, agile methodologies are sort of invisible</strong>; they're there, but they sort of melt away into the background. The team is just focusing on the actual work: how to meet a customer's need, what hypotheses they should be testing, or thinking critically about business requirements in order to find an 80/20 solution.</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-cult-of-test">The cult of test<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#the-cult-of-test" class="hash-link" aria-label="Direct link to The cult of test" title="Direct link to The cult of test" translate="no">​</a></h3>
<p>We had a few teams that drank the TDD kool-aid <strong>hard</strong>. One in particular had agreed as a team to take testing really seriously. They set about implementing extensive test suites, held book groups about <a href="https://en.wikipedia.org/wiki/Behavior-driven_development" target="_blank" rel="noopener noreferrer" class="">BDD</a>, and spent hours trying to compile Gherkin files so their business partners could define test cases (which they never actually did). They believed in the singular truth of the <a href="https://martinfowler.com/bliki/TestPyramid.html" target="_blank" rel="noopener noreferrer" class="">test pyramid</a>, and went big on unit tests, shunning higher-level approaches as lacking in virtue. They set up CI/CD to fail builds when code coverage dipped below 95%.</p>
<p>Things started to go sideways fast. Because their test suites were targeting low-level implementation details, every feature they implemented broke a gazillion tests. They were terrified to refactor anything (including the tests) because it would ensure weeks of work and an avalanche of merge conflicts. They'd get nothing done, sprint after sprint. The result wasn't amazing quality, it was utter paralysis.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Deep Thoughts</div><div class="admonitionContent_BuS1"><p>There's definitely some lessons to learn here specifically about testing practices, but I think this is really a more general case of a failure to apply <strong>continuous improvement</strong>. The intervention wasn't to parachute in and tell them how to do testing better, it was to have them pause and reflect, re-focus on what they were trying to accomplish, and <strong>give them permission to try something new</strong>.</p></div></div>
<p>With some perspective and coaching, they started to re-think their testing strategy. They began to see that some risks were more important to mitigate than others, and some testing techniques gave you a lot more bang for your buck. They came to the conclusion that 95% on a code coverage report wasn't a business objective, since we were building a <em>freaking e-commerce site</em>, not pacemakers or cruise missiles. They gradually woke up from their fog, and came up with some much more clever testing strategies, such as testing the high-level interfaces and APIs, which were far more stable, and gave them the freedom to refactor implementation details.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="moderately-strong-teams-outperform-superstars-surrounded-by-a-meh-team">Moderately strong teams outperform superstars surrounded by a meh team<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#moderately-strong-teams-outperform-superstars-surrounded-by-a-meh-team" class="hash-link" aria-label="Direct link to Moderately strong teams outperform superstars surrounded by a meh team" title="Direct link to Moderately strong teams outperform superstars surrounded by a meh team" translate="no">​</a></h3>
<p>We had a few squads where 4-5 (out of 6) team members were solid "A-" or "B+" players. In contrast, there were other teams held together by one very strong "superstar" lead surrounded by "B"s and "C"s. It was strikingly obvious to everyone that the teams with the more homogenous, moderately strong members significantly outperformed the teams with the superstar.</p>
<p>My take on the underlying cause was that the superstar would get randomized trying to support the rest of the team, and didn't have enough time to do anything innovative. If they ever did manage to spend some time on something interesting, it was generally too advanced for the team to run with, and the momentum fizzled. The rest of the team became helpless and dependent on the superstar to make decisions.</p>
<p>Of course, there was occasionally the more enlightened superstar, who would spend their energy trying to elevate their lower-performing team members. The effectiveness of this was highly dependent on the latent potential of the rest of the team, and usually didn't work without management intervening to shore up the team with additional strong engineers to support the superstar.</p>
<p>In general, the ratio of strong vs weak really can't dip below 2/1 before the team veered into unhealthy territory. The members of strong teams will support and help each other, but there has to be balance between them; if it gets asymmetrical, it starts to drain everyone.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Blinding flash of the obvious</div><div class="admonitionContent_BuS1"><p>Who knew, you need good people to have a good team.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="strong-pos-are-critical">Strong POs are critical<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#strong-pos-are-critical" class="hash-link" aria-label="Direct link to Strong POs are critical" title="Direct link to Strong POs are critical" translate="no">​</a></h3>
<p>We had Product Owners deployed to each squad, usually a PO would cover 2-3 squads. The difference between a great PO and a bad one was <strong>very</strong> stark. POs ultimately decide what the teams work on, so it seems fairly obvious that they'd have an outsized influence on the team.</p>
<p>At the core of the role is having great instincts about customers, the product, and the problem space. But there are other key factors which are underappreciated in POs:</p>
<ul>
<li class="">Empathy and rapport with developers</li>
<li class="">A willingness and interest in understanding technical limitations and tradeoffs</li>
<li class="">An understanding that they're not only responsible for the customer experience, but the long-term health of the technology it's built on</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="looking-back">Looking back<a href="https://labaneilers.com/let-a-thousand-flowers-bloom#looking-back" class="hash-link" aria-label="Direct link to Looking back" title="Direct link to Looking back" translate="no">​</a></h2>
<p>I imagine for most people working there at this time, being the guinea pigs in a giant experiment might not have been the ideal work experience. Actually a lot of folks thrived, and did some amazing work. But some reeled from the unrelenting waves of change, and others understandably just threw in the towel.</p>
<p>For me, though, it was very different. I got to see all this from a birds-eye view, but also on the ground, since I spent time with nearly every team. It was a bit like watching a whirlwind from inside- it gave me a sense of how seemingly small, well-meaning and thoughtful inputs can have huge, unintended effects that ripple across the whole system.</p>
<p>Like a lot of younger engineers, I used to occasionally express casual and flippant disregard for out-of-touch upper management. OK, admittedly I still might feel this way from time to time, but this experience left me with a very different understanding of what it takes to manage a large engineering organization, gave me a dose of humility and appreciation for how challenging it is.</p>
<p>I'm incredibly thankful to Erin for taking me with her on this journey- I really couldn't have had more learning jam-packed into a few short years of my life.</p>]]></content:encoded>
            <category>leadership</category>
            <category>engineering-management</category>
            <category>storytime</category>
        </item>
        <item>
            <title><![CDATA[DevOps is a stew]]></title>
            <link>https://labaneilers.com/devops-is-a-stew</link>
            <guid>https://labaneilers.com/devops-is-a-stew</guid>
            <pubDate>Sat, 20 Apr 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[DevOps, Microservices, Cloud, Automation and Infrastructure as Code, Containers and orchestrators, Continuous Deployment,and Platform Engineering all need each other]]></description>
            <content:encoded><![CDATA[<figure class="blog-image"><a title="jeffreyw, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Irish_Stew_(10320713316).jpg"><img alt="Irish Stew (10320713316)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/Irish_Stew_%2810320713316%29.jpg/256px-Irish_Stew_%2810320713316%29.jpg"></a></figure>
<p>When learning a new recipe, especially when dabbling in cuisine from different cultures, I find it really important to make sure one is really precise in their understanding the words used in the recipe. I've had a few unfortunate misunderstandings that resulted in... gastronomic disaster.</p>
<p>Similarly, I find that I can't responsibly use the word "DevOps" without testing that the person I'm talking to know which meaning I'm using. Here's some examples of what someone may think I mean when I say "DevOps":</p>
<!-- -->
<ul>
<li class="">The whole category of stuff that happens after you do a git commit, that magically makes your code turn into running software</li>
<li class="">A type of engineer that does the cloudy, opsy stuff</li>
<li class="">A style of operations where there's lots of automation and infrastructure as code</li>
<li class="">A culture where developers and operations people collaborate more tightly (vs the bad old days of "throw it over the wall")</li>
<li class="">A style of software development where the software engineers figure out how to deploy and operate everything themselves</li>
</ul>
<p>The thing I usually mean is similar to those last two, but here's a crisper version:</p>
<blockquote>
<p>An engineering management philosophy in which teams are are responsible for operating the software they build, in order to create a virtuous feedback loop which incentivizes the team to make their software highly reliable and operable.</p>
</blockquote>
<p>I think this is the most useful meaning of the word, mostly because a big percentage of the other meanings are covered by preexisting words, like "operations", "CI", or "infrastructure as code". I mean, yeah- there's definitely a specific modern flavor of operations, but I find it more useful to use more specific words about those practices.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>Quick Rant</div><div class="admonitionContent_BuS1"><p>Some companies have a team called "DevOps". When I hear this, my eyebrows become raised, and I wonder to myself if somebody just decided to rename the "Operations" team.</p><p>You know, the team the developer teams can "throw it over the wall" to.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="th-devops-flavored-stew">Th DevOps-flavored stew<a href="https://labaneilers.com/devops-is-a-stew#th-devops-flavored-stew" class="hash-link" aria-label="Direct link to Th DevOps-flavored stew" title="Direct link to Th DevOps-flavored stew" translate="no">​</a></h2>
<!-- -->
<img src="https://labaneilers.com/assets/images/stew-ingredients-4b2452688fcef2aa8dc345a93429c980.jpg" class="blog-image" alt="Stew ingredients">
<p>The thing I find really interesting about the engineering-management philosophy definition of "DevOps" is how interdependent it is on a whole bunch of other ingredients that coincided historically with it:</p>
<ul>
<li class="">Microservices</li>
<li class="">Cloud</li>
<li class="">Automation and Infrastructure as Code</li>
<li class="">Containers and orchestrators</li>
<li class="">Continuous Deployment</li>
<li class="">Shift to automated testing and observability vs manual QA</li>
<li class="">Platform Engineering</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="microservices">Microservices<a href="https://labaneilers.com/devops-is-a-stew#microservices" class="hash-link" aria-label="Direct link to Microservices" title="Direct link to Microservices" translate="no">​</a></h3>
<p>About 10 years ago, the company I was working for had outgrown our monolith, and we reluctantly started on the journey to microservices, and the journey was still ongoing when I left (about 6 years ago).</p>
<p>The thing that surprised us first was the sheer magnitude of the overhead of managing all the operational stuff that had been solved problems in the monolith. The first teams that started extracting their own services spent many weeks just trying to replicate a small fraction of what we had previously taken for granted: reliable builds, rolling deployments, centralized logs, metrics, alerting, feature flags and associated tooling, etc.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Flashback</div><div class="admonitionContent_BuS1"><p>Sometimes looking back on this time, I think about just how <strong>adorable</strong> it was that we thought we were going to move to microservices but everything else was just going to continue working as-is. We were so cute.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="cloud">Cloud<a href="https://labaneilers.com/devops-is-a-stew#cloud" class="hash-link" aria-label="Direct link to Cloud" title="Direct link to Cloud" translate="no">​</a></h3>
<p>At that point, we started toying with some public cloud (AWS), and found that there was a lot of excitement from teams using it. The fact that their infrastructure could be fully automated through reasonable APIs alleviated a whole bunch of the pain we were feeling trying to automate deployments to on-prem servers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="infrastructure-as-code">Infrastructure as code<a href="https://labaneilers.com/devops-is-a-stew#infrastructure-as-code" class="hash-link" aria-label="Direct link to Infrastructure as code" title="Direct link to Infrastructure as code" translate="no">​</a></h3>
<p>After building out some of this cloud automation with shell scripts, we quickly discovered that we needed some more powerful ways to manage infrastructure as code. I think at that point we played with CloudFormation and some early Terraform. We were still struggling though, caught between the low-level (infrastructure-as-a-service) nature of EC2, and the relatively immature platform-as-a-service offerings AWS had at the time. We made a little headway with tools like Spinnaker and Octopus, but deployments were still relatively slow and risky.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="containers-and-orchestrators">Containers and orchestrators<a href="https://labaneilers.com/devops-is-a-stew#containers-and-orchestrators" class="hash-link" aria-label="Direct link to Containers and orchestrators" title="Direct link to Containers and orchestrators" translate="no">​</a></h3>
<p>Around this time, Docker was making waves, and we started experimenting with it and early versions of (pre-EKS) Kubernetes and ECS. The speed and ease of deployments, relative to what we had been doing with hand-rolled automation of EC2 and autoscale groups was game changing. Suddenly, treating infrastructure as immutable felt natural, and deployments were cheap and fast.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="continuous-deployment">Continuous deployment<a href="https://labaneilers.com/devops-is-a-stew#continuous-deployment" class="hash-link" aria-label="Direct link to Continuous deployment" title="Direct link to Continuous deployment" translate="no">​</a></h3>
<p>The teams that had adopted containers, kubernetes, and ECS quickly discovered the power of continuous deployment. While it was technically possible previously, deployments were slow enough that teams were still batching up changes and doing big-ish releases (maybe a couple times a week). Now, the opportunity presented itself to deploy any given feature the second it was ready.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="shift-to-automated-testing-and-observability-vs-manual-qa">Shift to automated testing and observability vs manual QA<a href="https://labaneilers.com/devops-is-a-stew#shift-to-automated-testing-and-observability-vs-manual-qa" class="hash-link" aria-label="Direct link to Shift to automated testing and observability vs manual QA" title="Direct link to Shift to automated testing and observability vs manual QA" translate="no">​</a></h3>
<p>As the braver teams started to actually practice continuous deployment, they found that there was an increase in the number of bugs that would remain undiscovered, sometimes for days. In retrospect, our culture had been too reliant on having a QA team, who was organized around doing manual regression tests on big batch releases. Teams began to re-discover the need for some essential ingredients of continuous deployment:</p>
<ul>
<li class="">Robust observability and alerting</li>
<li class="">Running automated regression testing in CI</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="platform-engineering">Platform Engineering<a href="https://labaneilers.com/devops-is-a-stew#platform-engineering" class="hash-link" aria-label="Direct link to Platform Engineering" title="Direct link to Platform Engineering" translate="no">​</a></h3>
<p>At this point there was a huge, and increasing gap in maturity between teams who had invested significantly in their operational capabilities, and teams that hadn't. It was also clear that to get to even a baseline level of continuous deployment required months of investment from <strong>every</strong> team... and I don't think we'd even come to grips with the reality of maintenance on all that stuff.</p>
<p>It became obvious we needed to find a way to share the capabilities between teams, so we started experimenting with ways to reclaim some of the abilities we used to have with our monolith- but in a way that worked in a world of autonomous teams and distributed systems.</p>
<p>We quickly realized that some things were a no-brainer to centralize: running Kubernetes clusters, CI/CD, and observability infrastructure, in particular. We also started playing with integrating other opinions and best practices into tooling, and trying to find the balance between operational uniformity and developer freedom. At some point in the past few years, we started calling this "Platform Engineering".</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="flavoring-is-key">Flavoring is key<a href="https://labaneilers.com/devops-is-a-stew#flavoring-is-key" class="hash-link" aria-label="Direct link to Flavoring is key" title="Direct link to Flavoring is key" translate="no">​</a></h2>
<p>Looking back, it's actually hard for me to imagine, in any practical way, how any of these practices could exist on their own. I mean, you can gnaw on a potato, but it's hard to call that a meal without the rest of the ingredients.</p>
<p>Historically speaking, there's a few more tasty bits sprinkled into this stew:</p>
<ul>
<li class="">Servant leadership and <a href="https://www.youtube.com/watch?v=nzynH2BmoJM" target="_blank" rel="noopener noreferrer" class="">Intent-based leadership</a></li>
<li class="">The <a href="https://theleanstartup.com/" target="_blank" rel="noopener noreferrer" class="">Lean Startup</a> varietal of Agile</li>
</ul>
<p>These are really key to building a modern engineering culture where developers can flourish; you can't have a high-performing team without enlightened leadership.</p>
<p>I should note that I've tasted a version of this stew, but with these flavorings replaced with "command and control" and "waterfall". That stew tasted like garbage.</p>]]></content:encoded>
            <category>platform-engineering</category>
            <category>devops</category>
            <category>engineering-management</category>
        </item>
    </channel>
</rss>