Skip to main content

OpenTelemetry's secret weapon

· 10 min read
TL;DR

As OpenTelemetry's adoption has surged, it's drawn increasing criticism: it's complex, isn't fully matured, and its user experience can feel... unpolished. While these are valid gripes, I think we've hit an inflection point where OTel's benefits outweigh its pain points, especially when compared to the alternative of proprietary telemetry pipelines and lock-in with the dominant (and outrageously expensive) vendors.

In the next year or so, I think its benefits are going to increase dramatically due to its secret weapon: semantic conventions. These conventions allow any observability vendor to create the same rich, powerful, out-of-the-box user experiences that the dominant players had locked-down via their ownership of the entire telemetry pipeline.

A superhero holding a secret weapon

Now that OpenTelemetry has gained such significant traction, it's starting to attract a lot of attention beyond the hardcore observability community. While most of what I read about OTel is pretty positive, it also draws its fair share of shade.

Honestly, I get it. As a big user of OTel, I've spent plenty of hours rage debugging OTTL filters in OTel collectors, desperately searching for SDK examples that actually work, or pulling my hair out trying to figure out which version of a protobuf schema changed and broke telemetry from my Swift clients. And like a lot of the haters, I also get frustrated that the overall level of complexity in the specifications leaks into the implementation details of every part of the project.

All that said, I'm incredibly thankful that OTel exists. While it's still in its awkward teenage years, it's already changing the industry dramatically. Despite the challenges that currently exist, I think OpenTelemetry is the only reasonable path forward for observability.

So, like, I totally already know what OpenTelemetry is, but just give me a little refresher

OpenTelemetry is an umbrella project with the broad and audacious mission to provide a set of open tools and standards for building telemetry pipelines.

WTF is a telemetry pipeline?

A telemetry pipeline is a set of tools that create, collect, and process telemetry data before sending it to a backend (e.g. Prometheus, DynaTrace, Honeycomb, etc) for storage and analysis. This includes:

  • The libraries/agents that instrument your application, and the code that uses them to record events, take measurements, collect performance data, and emit it as metrics, logs, traces, etc.
  • Any intermediate tools that collect the data, process/filter it, and forward it to a backend (e.g. DogStatsD, LogStash, Fluent Bit, Telegraf, etc)

The before times

Before OTel, the tools available were mostly proprietary and vendor-specific. This has some big downsides:

  • Black boxes: Telemetry agents/libraries/SDKs (the thing you install into your app to collect metrics, logs, etc) were closed source, not modifiable or verifiable by users. If they didn't work exactly as you need, you could really only hope to persuade your vendor to add a feature for you.
  • Vendor lock-in: Since most telemetry agents are vendor-specific, switching between them can be very labor intensive, which gives vendors enormous leverage to develop predatory, rent-seeking pricing strategies

OK, so what actually is OpenTelemetry?

OpenTelemetry provides an alternative ecosystem, starting with some open specifications:

  • A standard telemetry protocol (OTLP) that defines the low-level shape of telemetry data (including traces, metrics, and logs) and how it gets transmitted
  • A lightweight, standardized API that can be used to add instrumentation to your code
  • SDKs that implement the API and configure your applications to export telemetry data
  • Semantic conventions that define the meaning of fields in telemetry data

And then, more practically for users, there's some specific software:

  • Language-specific SDKs: A set of (mostly language-specific) SDKs that implement the SDK and API specifications. This is what you'd install into your apps (as an alternative to a vendor-specific agent). You'd use the APIs to sprinkle manual instrumentation throughout your first-party code, and framework/library authors would use the API to add it to theirs.
  • Auto-instrumentation: A vast ecosystem of auto-instrumentation libraries and tools that automatically instrument all the most popular frameworks (e.g. Django, Express, Spring etc) and libraries (http, Postgres, Redis, Mongo, etc), using the API under the hood.
  • The OpenTelemetry Collector: A server application that's a sort of "universal translator" for telemetry data. It can accept telemetry in virtually any format you can imagine (e.g. statsd, Prometheus, Zipkin, InfluxDB, Loki, etc), apply arbitrary transformations, filters, and enrichments, and then export the data to an equally astonishing number of destinations formats (e.g. Jaeger, Prometheus, Splunk, Honeycomb, DataDog, Kafka, ClickHouse, etc).
  • The OTel ecosystem: A huge ecosystem of tooling (e.g. Kubernetes operators, helm charts, eBPF agents) that compose the SDKs, collector, protocols, etc, to create new capabilities.

And the part of OTel that astonishes me the most:

  • Vendor support: Basically every observability vendor you can think of now supports OTel.

Wait, why are all the vendors supporting OpenTelemetry?

The most incredible thing about OpenTelemetry is that it even exists at all, much less that it's supported by virtually all observability vendors. Why would they voluntarily back a project that promotes vendor-neutrality, and undermines their ability to lock-in customers?

Early on, there was certainly an advantage to smaller vendors adopting OTel as a competitive differentiator. What's surprising is how much pressure this put on the industry, which slowly moved up-market, like a growing tsunami, until even the most dominant vendors felt pressure to offer at least some support (or lip-service for support).

The actual threat of OpenTelemetry

Before OTel, the dominant vendors had for years enjoyed an advantage that wasn't particularly obvious: their end-to-end ownership of the telemetry pipeline gave them full understanding of the meaning of the data they collected.

If you've ever used one of the big tools, like DataDog or New Relic, you've probably experienced the customer experience benefits of this capability first hand:

  1. Install the vendor's proprietary agent into your apps (directly, or through some platform-wide integration)
  2. Open up the vendor's dashboard, and instantly see gorgeous dashboards with all kinds of pre-built visualizations and alerts, all out of the box. Revel in the insights and ease of use.
A few more steps

I forgot to mention a couple other steps:

  1. Get your first bill, freak out at the number of digits, and instruct your team to agonize over every instrumentation choice in a futile effort to manage costs
  2. As the contract renewal negotiation approaches, realize you're completely fucked now that you've fully integrated their proprietary telemetry pipeline

The obvious part is that OTel threatens this advantage by giving us the ability to create our own telemetry pipelines. But the real kicker, hidden in the list of stuff above that OTel does, is what I think the most powerful and underappreciated part of the project:

  • Semantic conventions that define the meaning of fields in telemetry data

Why is this item so important?

Semantic conventions are what made the dominant observability tools so damn useful

When you open that first dashboard in DataDog or NewRelic, you feel like you're standing in the Pentagon war room or something. There's so many graphs with data you've been dying to visualize- especially about your infrastructure- that were clearly very carefully curated. Your Kubernetes clusters are suddenly transparent and stripped of all their mystery. Within just a few seconds you can get a broad sense of the state of your system, and then drill down with a few clicks to find individual logs representing anomalies. The tooling gives you the ability to correlate data from different signals, allowing you to jump back and forth between aggregate performance graphs and individual exemplar pod data.

The dominant, legacy vendors have been able to create this user experience because they've utilized their end-to-end ownership of the telemetry pipeline to establish (proprietary) conventions about the meaning of the data they're collecting.

  • They scoop up all that rich Kubernetes and cloud provider metadata (pods, nodes, namespaces, EC2 instance type, region, AMI ID, etc) and use it to enrich metrics and log lines with labels
  • They collect performance data using specific conventions and units
  • They auto-instrument your applications using language-specific instrumentation agents that know how to instrument all the most popular libraries and frameworks, using consistent metadata across all of them
  • They track the lifecycle of requests through your system, and correlate signals using identifiers sent through as (e.g.) custom HTTP headers

By the time the telemetry data gets to their system, they already know an enormous amount about what it all actually means. They can then use this to create higher-level visualizations which understand all that Kubernetes and cloud provider metadata, supply canned alerts that handle a bunch of common Kubernetes failure modes, or build maps of services that show all the runtime relationships by using the correlation IDs they injected into your HTTP requests.

This is why I think of OpenTelemetry's semantic conventions as a secret weapon. Now all observability vendors have the ability to go beyond just being able to ingest data from an OTel-based pipeline; they can now add this kind of rich, powerful, out-of-the-box user experience powered by a deep understanding of the meaning of the data they're collecting.

And increasingly, they are!

It's not perfect, but it's already better than the alternative

So when I read posts criticizing OpenTelemetry for its complexity, or its lack of maturity, or claiming that it's designed by committee, or it's hindered by having to satisfy the lowest common denominator, I can simultaneously agree with all that, while still believing that it's still the best way forward.

Like most technical choices, this is fundamentally about tradeoffs, and I think OpenTelemetry has hit the inflection point where its benefits have begun to outweigh its pain points for many, many situations. I'm acutely aware that there's still a lot of friction in getting OTel set up, but you've got to compare it to the alternatives.

OpenTelemetry has finally become a workable enough solution that allows an organization, with enough effort, to develop a top shelf observability capability - without having to trade-in their firstborn child in exchange for, say, a few months of custom metrics.

And OTel is only going to get better. So many brilliant people are contributing, vendors are investing heavily, the ecosystem is growing, and adoption is surging.

An OTel anecdote that brings me joy

I'd like to leave you with one of my favorite moments in OpenTelemetry from the last year:

The OpenTelemetry community implemented a DataDog receiver for the OTel collector. When I say the community, I'm saying it specifically did NOT come from DataDog; it was driven by community members and competing vendors (e.g. Grafana Labs, Splunk). This receiver, along with some supporting processors, allows organizations that are stuck using DataDog's proprietary agents to redirect their telemetry data to any vendor that supports OTel.

Imagine how DataDog sales felt the first time this came up in a contract negotiation!