Are we ready for Observability 2.0?

January 30, 2025 · 19 min read

TL;DR

Observability 2.0 is a vision of observability that seeks to replace the traditional "three pillars" of observability (metrics, logs, and traces) with a single source of truth: wide events.

This vision is compelling, but there are a number of obstacles that make it difficult to adopt in practice. We're now thinking about Observability 2.0 as a philosophy we can work towards gradually.

At SimpliSafe, we manage a pretty large system of microservices. Because we're entrusted by our customers to protect their homes and families, we take reliability of our systems pretty seriously, so observability is pretty important to us.

Like most companies today, our observability strategy is built around the "three pillars" approach: metrics, logs, and traces. Of the three, we're currently the most dissatisfied with our logging tooling, and have been working on finding a better product.

We're already a Honeycomb customer, and in our conversation with them, they ended up making a pretty interesting case that we should consider a new approach: ditch the three pillars and make traces the center of our strategy. We had up to this point been thinking very incrementally, and here was Honeycomb, coming in hot with a bold and revolutionary vision: Observability 2.0.

WTF is Observability 2.0?

The term "Observability 2.0" was coined by Charity Majors (the CTO and co-founder of Honeycomb) as a shorthand for a new vision of observability, defined in opposition to the "three pillars" model (metrics, logs, and traces).

charity.wtf

If you're somehow one of the 5 technology people on the internet that isn't already aware of Charity's blog, you should really stop reading this drivel and head there immediately. She's a fountain of insight on technology, engineering management, observability, and many other topics.

Also, I love her writing style, and her use of profanity as a persuasive device is fucking delightful.

Here's the gist of Observability 2.0:

Collect telemetry in the form "wide events" (usually traces, but also maybe structured logs) which can contain an arbitrary number of key/value pairs with high cardinality values
Your observability tooling should allow you to query these events to answer arbitrarily complex questions about your system

In this vision, traditional metrics (i.e. timeseries emitted directly from your apps) are considerably less valuable, because they, by definition, are stripped of context when they're aggregated into timeseries. Since cardinality is the main driver of cost in a timeseries database, you have to choose which attributes you care about (ensuring they're low/moderate cardinality), and drop the rest. This means you also have to know what questions you'll want to ask of your system in advance of deploying the code.

With wide events, on the other hand, you keep all that rich, high-cardinality data, and then you get to slice and dice it in ways you couldn't have predicted you'd need in advance. You can use wide events to derive new metrics on the fly, and use the rich attributes to dig into specific anomalies in ways that are impractical with pre-aggregated timeseries data.

Furthermore, in Observability 2.0, logs become somewhat redundant with traces, too. Traces are an ideal superset of logs: structured events, but with additional conventions around how they represent units of work and their relationships (i.e. spans with durations, parents, siblings, and children).

This is a pretty compelling vision: one universal source of truth for all your observability data. You don't have to suffer the cost of 3 pillars when you can derive all the same value from wide events, and you get to simplify your whole practice around a smaller, more powerful set of tools.

Rock and roll! This sounds amazing! Sign me up!

The gap between vision and reality

To be fair, the idea of Observability 2.0 is something my colleagues at SimpliSafe and I been thinking about for a while now. Charity has written and spoken a whole lot about it, and because of the elegance of the idea (and her fantastic writing), we already understood the core ideas, and had experience with Honeycomb, so we were primed to consider something a bit more forward thinking.

When Honeycomb came back to us with a concrete plan, we realized it was finally time to take off our incrementalist hats and start considering what it would really mean to go all in on traces as the source of truth.

It didn't take long for us to start enumerating concerns.

In defense of traditional metrics

I may have oversimplified the argument against metrics; Charity's actual words are a bit nuanced. For instance, she still sees value in using metrics to monitor infrastructure (especially third party infra). It's just that for the code that you're writing yourself- at the core of your business- and is changing constantly, metrics alone aren't sufficient to understand what's going on.

This is absolutely true. Metrics alone are insufficient to really understand the behavior of your system. You absolutely need high-cardinality data such as logs and traces to answer many types of questions.

But I would argue that traditional metrics are still necessary as a complement to wide events due to the huge and fundamental cost discrepancy between the two.

Let's talk about cost

The reason we still need metrics comes down to the enormous discrepancy in cost between metrics vs wide events. Metrics are extraordinarily cheap. Like a fraction of the cost of logs or traces, and the comparison gets more and more one-sided as you scale up.

Timeseries metrics are incredibly compact and lightweight. Cost is driven by the number of timeseries you're producing: unique combinations of label/value pairs (e.g. pod, request_path, response_code). These pairs get stored once, and then the rest of the cost is basically just a number stored every interval (30 seconds or so). The tradeoff is that you lose context of the individual events from which the aggregate measurement was derived.
Wide events are fundamentally a more heavyweight signal than metrics, since each event is basically its own bag of string attributes. Vendors generally charge for each span/event (or by number of bytes) ingested. You also pay a cost at runtime: there's more data to generate, encode, buffer, and transmit.

wide events, caviar, and private islands

And lets just be clear, if money were no object, of course I'd rather have full-fidelity wide events than metrics for everything!

I'd also prefer to dine exclusively in restaurants with Michelin stars, but I don't have the means to do that either.

Let's compare costs as you scale

As the frequency/volume of events increases, the fundamental overhead of wide events grows increasingly impractical in terms of computation, network, and storage... and ultimately dollars. This is for three fundamental reasons:

metrics are aggregated into a single value over an interval, whereas wide events are not
metric labels are stored only once per long-running series, unlike wide events, in which all attribute values are stored in full for every event
wide events, by design, carry a lot more data (all the high-cardinality goodness you don't want on your metrics)

For example: imagine you have 10 pods which each handle 100 requests per second. You can instrument this with a single histogram metric, at the cost of a few active series (lets say 10 for the sake of argument) per pod. With traces, you're producing 100 spans per second, per pod (one per request).

With metrics, you have 100 active timeseries. Using list pricing for Grafana Cloud of $8 for 1K series, this will cost 80 cents per month.
With traces, you're producing 1000 spans per second. Using list pricing for Honeycomb of 100M events for $130, this will cost ~$3300 per month.

Now let's imagine you find a bottleneck, and optimize this code to be 10x faster. Now each of your 10 pods can handle 1000 requests per second. Awesome! Let's scale up our marketing spend and 10x traffic:

With metrics, you'll still have 100 active timeseries, at 80 cents per month.
With traces, you're now producing 10,000 spans per second, which will cost ~$33K per month.

I wasn't kidding, right? And this is only the cost from Honeycomb; you're also paying the cost in terms of runtime overhead for creating all those spans.

Ingest-time aggregation is a game changer

There's now a number of new, ingest-time aggregation tools (e.g. Grafana Cloud's Adaptive Metrics or Chronosphere's aggregation rules) which can drastically reduce active series by aggregating away unneeded high-cardinality labels.

In the example above, if you only cared about the request duration percentiles across all pods (and not per pod), you could aggregate away the pod label, resulting in a single histogram for all 10 pods- at a cost of 8 cents per month regardless of how many pods you scale to.

We're using Grafana's Adaptive Metrics at SimpliSafe, and it's reduced our metrics bill by about 80%.

Having the ability to choose metrics vs wide events for a given use-case can give you a lot more flexibility to find a good balance between the richness of telemetry data and its cost.

Wait, so how is anybody using wide events?

You may think the cost scaling example above is contrived (which is true)- you'd probably use a lot more labels on your metrics in practice, which increases the number of series. But calculating the total cardinality you'd generate is complicated, because the interaction between metrics labels produces a sparse matrix; you don't pay for all possible combinations of label values, only the combinations that actually occur.

So let's talk about the real world, specifically telemetry at SimpliSafe. If we were to send all of our traces to Honeycomb, it would be an order of magnitude more expensive than what we're currently paying for all our timeseries metrics.

But wait, you ask: there are plenty of other companies happily using tracing- how are they affording it?

Yep- they're probably doing what we're doing with our traces: sampling.

Parents, have you talked to your kids about sampling traces?

Sampling is a whole topic in itself; the gist is that you randomly select one out of every N traces, drop the rest, and rely on statistical extrapolation when you run queries. The attributes necessary to do sampling are actually built into the OpenTelemetry spec.

There's some tooling available, which can use these attributes, to do tail sampling (we use Honeycomb's Refinery for this). Honeycomb also does a nice job of extrapolating sampled values at query time based on the sample ratio, so it feels pretty opaque to a user- sometimes you forget you're working with sampled spans.

We're currently sampling at a 30/1 ratio, which, at our volume, is usually sufficient to get accurate enough aggregate measurements.

What's your sampling ratio?

I had dinner recently at an Observability community event, and met a nice guy in a similar position to mine. When I found out he was using Honeycomb too, I asked him what sampling ratio they were using.

After I asked, I suddenly felt awkward, like maybe that was too personal a question to ask someone I'd just met.

Asking someone what sampling ratio they use with Honeycomb

Sampling is complicated and error prone

Theres a lot of moving parts required to make sampling work at scale. All it takes is one programmer in one service to make a mistake with an attribute that controls sampling behavior, and you can end up with whole classes of missing traces, or conversely, accidentally disabling sampling and blowing up your bill. We've had both of these happen, and in a couple cases, we've had queries return very misleading results.

Mea culpa

Just to be clear, these misleading results weren't Honeycomb's fault- it was errors on our side related to the subtleties of OTEL sampling configuration.

Luckily, in these cases, we had traditional metrics available that directly contradicted the wrong conclusions we were drawing from the traces, and saved us from making some pretty bad decisions.

So if you're doing sampling on your wide events, which you almost certainly are to manage costs, traditional timeseries metrics are extremely valuable for corroborating results from your queries against wide events.

Sampling breaks forensic use cases

The other big problem with sampling is that it makes it virtually impossible to rely on traces for forensic/diagnostic use cases. If you're trying to diagnose an issue for a specific customer, client device, or other specific request, the chances of having a trace available is slim.

Gut check: imagine trying to investigate a potential security breach with anything other than 100% of logs/traces. Who would put themselves in that position?

So for these forensic use cases, we end up needing our unsampled logs, and not traces.

The historical inertia of metrics

I'll finish up my defense of metrics with a couple other eminently practical reasons which make it hard to imagine giving up traditional metrics entirely:

Metrics are boring, simple, tried and true. Virtually every mature industry uses metrics to drive their business and operations.
Metrics are ubiquitous. Everything supports metrics. Most every tool in the cloud-native ecosystem exposes Prometheus metrics, and very few emit traces (though this is slowly changing). Cloud providers also expose telemetry primarily as metrics (e.g. CloudWatch, Azure Monitor, etc).

Logs vs traces

While Charity's Observability 2.0 vision of wide events technically includes structured logs, they generally play second fiddle to traces. Either way, it's worth a quick tangent to enumerate reasons why logs are still valuable:

Logs are simple and ubiquitous. Everything writes logs. Also all OSS/third-party infra produce logs, and very few produce traces.
The developer experience for logs is dead simple, and great by default. Use printf(), start your app from a terminal, and watch the logs flow through stdout. It's a beautiful thing.
Compared to traces, logs are a much more natural way to model events which aren't tied to requests made across services (e.g. lifecycle events, background jobs, etc).
Logs are mature and battle-tested. They've been used since the dawn of computing.

Here's a fun example: If you're using traces only, and are not saving logs, where would you look to diagnose an app crashing? OTEL instrumentation isn't going to politely wrap up the current span and flush its buffer when your app panics. You're just going to lose any record of the cause of the panic.

Here's a few more examples we've experienced where the OTEL tracing ecosystem's maturity has bit us:

With certain node.js Promise libraries, trace context can get lost across async function chains, leading to orphaned spans
Long-running spans (more than a few seconds) are generally an unsolved problem- the behavior isn't well defined, and spans can get dropped or orphaned.
It's not currently possible to configure sampling to capture related traces: single logical operations that involve multiple requests, async events, etc.
Context propagation can get broken across a whole system by one service that's using an older OTEL SDK, listens via an esoteric protocol, or is just a little buggy
Graceful shutdowns are a lot trickier with OTEL tracing, because you have to first drain your actual connections/queues, then ensure you're flushing all in-progress spans before exiting. Any mistakes: you guessed it, dropped spans.

The path forward for Observability 2.0

I'm not writing this to dump on either OpenTelemetry, tracing, or the overall vision of Observability 2.0. I'm still a huge fan of the idea of simplifying signals down to minimize telemetry sprawl, and I've experienced the benefits of OTEL, tracing, and wide events firsthand.

Honeycomb is indeed a fucking pleasure to use- it's super fast, very powerful, and intuitive. Honeycomb's mere existence has prompted a virtual tidal wave of innovation across the observability space as competitors struggle to react to its power and elegance.

But despite how seductive the vision of Observability 2.0 is, when it came time to make a bold decision to go all in... we equivocated, put our incrementalist hats back on, and decided to continue with the three pillars for the time being. Observability 2.0 is just a bit too cutting edge at the moment.

The obstacles that prevented us from saying "yes" today may be solvable in time. I hear there are companies that have already gone all in on Observability 2.0, and I've got to believe it's possible.

To close the gap for a company like SimpliSafe, here's the problems we'd need solved:

Affordability

Unless we can get the cost of wide events down, sampling will continue to be necessary, and sampling undermines the idea that we can discard the three-pillars model:

Sampling traces makes them useless for diagnostic and forensic use cases, requiring you to retain logs
The complexity of sampling traces makes it more important to retain traditional metrics to corroborate results

This is a fundamental catch-22: sampling and Observability 2.0 are fundamentally incompatible.

Just to be clear

I think that in our three-pillars world, sampling traces is actually a great tradeoff. You get most of the value, and you pay a fraction of the cost.

But I'm only OK with this tradeoff because I have metrics and logs to cover the gaps.

I'm seeing some pretty exciting things happening to make logs/traces more cost-efficient, such as products based on OSS columnar databases like Clickhouse, as well as other tracing projects like Grafana Tempo and QuickWit.

And I'm certainly not going to count out Honeycomb- knowing them, they'll continue to chip away at inefficiencies and find more ways to make their product more and more affordable, especially as they gain greater economies of scale.

Developer experience

The developer tools available for tracing are legitimately a lot more complex than those available for logs. We'll need ubiquitous tracing developer tools that approach the ease of use of printf() debugging, and that make it just as easy to validate that your tracing instrumentation is working the way you intended.

The developer experience for OTEL SDK configuration is still... a work in progress. I hope to see more projects that package up the OTEL SDKs as a "distribution" to provide an installation experience that's as simple as vendor agents like NewRelic/DataDog's. Let's just admit that 200 lines of boilerplate before you can send a single span is a bit much.

Documentation is also a big opportunity for improvement. I found it impossible to get an SDK properly configured without digging through the source code. We need a lot more examples, and more focus on use-cases.

And please, for the love of god, let's not add another layer of YAML to "simplify" configuration.

My new years resolution

Instead of just complaining about OTEL SDK configuration and docs, I'm going to shut up and start contributing.

OTEL is amazing, thank you to everyone who's worked on the project. We appreciate you!

Reliability and maturity

Beyond installation and configuration, the OTEL tracing ecosystem needs some time to ripen, making it easier to get correct behavior (e.g. graceful shutdowns, context propagation, etc), handling edge cases (crashes/panics), and defining some standards for handling things like long-running spans, related traces, and unfinished spans.

Ecosystem and adoption

As long as most infrastructure and third-party tools are emitting only logs and metrics, it's going to be hard to go all in on tracing for first-party telemetry. It's possible that OpenTelemetry's trajectory will continue, and more and more OSS tools will start emitting traces, but there's a lot of ground to cover to hit critical mass.

As a litmus test: an Observability 2.0 product would need some sort of drop-in Kubernetes infrastructure monitoring solution similar to what you can get out-of-the-box with DataDog/NewRelic, or at least something comparable to the Prometheus Operator, which are currently almost entirely driven by metrics.

Being comfortable with incrementalism

Throughout my career, I've generally been biased towards incrementalism over revolutionary changes. I'm pretty stingy about my innovation tokens, and I tend to want to save them for things that will drive our core business strategy. Observability is something I want to keep boring and predictable.

Perhaps Observability 2.0 isn't a purist standard we should aim to achieve in black and white terms. Rather, it's a philosophy we can apply incrementally, in which we gradually get more and more value from our observability spend.

Right now, we're thinking about how to better use the three pillars we have:

Seek out tools that integrate and visualize data from across multiple pillars (e.g. using exemplars to link metrics to traces)
Look for ways to reduce the waste of overlapping, duplicate pillars
Try to utilize each pillar for its strengths, with tooling that uses them to complement each other
Look for more sustainable ways to manage costs

Hopefully in a few years, the ecosystem will have evolved around the Observability 2.0 vision, and we'll be in a position to be a bit braver about our next steps.

WTF is Observability 2.0?​

The gap between vision and reality​

In defense of traditional metrics​

Let's talk about cost​

Let's compare costs as you scale​

Wait, so how is anybody using wide events?​

Parents, have you talked to your kids about sampling traces?​

Sampling is complicated and error prone​

Sampling breaks forensic use cases​

The historical inertia of metrics​

Logs vs traces​

The path forward for Observability 2.0​

Affordability​

Developer experience​

Reliability and maturity​

Ecosystem and adoption​

Being comfortable with incrementalism​