Skip to main content

4 posts tagged with "opentelemetry"

View All Tags

OpenTelemetry's secret weapon

· 10 min read
TL;DR

As OpenTelemetry's adoption has surged, it's drawn increasing criticism: it's complex, isn't fully matured, and its user experience can feel... unpolished. While these are valid gripes, I think we've hit an inflection point where OTel's benefits outweigh its pain points, especially when compared to the alternative of proprietary telemetry pipelines and lock-in with the dominant (and outrageously expensive) vendors.

In the next year or so, I think its benefits are going to increase dramatically due to its secret weapon: semantic conventions. These conventions allow any observability vendor to create the same rich, powerful, out-of-the-box user experiences that the dominant players had locked-down via their ownership of the entire telemetry pipeline.

A superhero holding a secret weapon

Now that OpenTelemetry has gained such significant traction, it's starting to attract a lot of attention beyond the hardcore observability community. While most of what I read about OTel is pretty positive, it also draws its fair share of shade.

Honestly, I get it. As a big user of OTel, I've spent plenty of hours rage debugging OTTL filters in OTel collectors, desperately searching for SDK examples that actually work, or pulling my hair out trying to figure out which version of a protobuf schema changed and broke telemetry from my Swift clients. And like a lot of the haters, I also get frustrated that the overall level of complexity in the specifications leaks into the implementation details of every part of the project.

All that said, I'm incredibly thankful that OTel exists. While it's still in its awkward teenage years, it's already changing the industry dramatically. Despite the challenges that currently exist, I think OpenTelemetry is the only reasonable path forward for observability.

Are we ready for Observability 2.0?

· 18 min read
TL;DR

Observability 2.0 is a vision of observability that seeks to replace the traditional "three pillars" of observability (metrics, logs, and traces) with a single source of truth: wide events.

This vision is compelling, but there are a number of obstacles that make it difficult to adopt in practice. We're now thinking about Observability 2.0 as a philosophy we can work towards gradually.

A bee looking through a telescope

At SimpliSafe, we manage a pretty large system of microservices. Because we're entrusted by our customers to protect their homes and families, we take reliability of our systems pretty seriously, so observability is pretty important to us.

Like most companies today, our observability strategy is built around the "three pillars" approach: metrics, logs, and traces. Of the three, we're currently the most dissatisfied with our logging tooling, and have been working on finding a better product.

We're already a Honeycomb customer, and in our conversation with them, they ended up making a pretty interesting case that we should consider a new approach: ditch the three pillars and make traces the center of our strategy. We had up to this point been thinking very incrementally, and here was Honeycomb, coming in hot with a bold and revolutionary vision: Observability 2.0.

Prometheus vendor death match

· 13 min read
TL;DR

We evaluated a number of observability vendors, with a focus on metrics, and did detailed PoCs with both Chronosphere and Grafana Cloud. Both are excellent products, and have slightly different strengths.

Death match

At work, we're in the process of rebuilding our metrics pipeline, as we've outgrown our old self-managed TIG (Telegraf, InfluxDB, Grafana) solution. We've had this solution in place for many years, and it's served us well. Especially given the increasingly predatory pricing models of observability vendors, it's been extraordinarily cost-effective.

But over the last couple years, as we've grown, we've started to hit the limits of what we can handle with a single, vertically scaled instance of InfluxDB (especially using InfluxDB v1). It was increasingly stressful to keep it running smoothly, and we had to be very vigilant about cardinality, as it's very easy to accidentally introduce a cardinality explosion that can bring down the entire database.

Fun with OTEL collectors and metrics

· 6 min read
OpenTelemetry Logo

As part of an evaluation of Prometheus compatible monitoring solutions, I found the need to push our use of the OTEL Collector to handle some use cases like creating metrics allowlists, renaming metrics, or adding and modifying labels.

Here's some examples, based on what I learned, of the crazy and powerful things you can do with OTEL collector processors to manipulate metrics.