LIMITED TIME OFFER! I’m ready for my next adventure as a DevRel advocate/Technical Evangelist/IT Talespinner. If that sounds like something you need, drop me a line by email or on LinkedIn.
My blog on pricing from the other day caught the attention of the folks over at MetricFire, and we struck up a conversation about some of the ideas, ideals, and challenges swirling around monitoring, observability, and its place in the broader IT landscape.
At one point, JJ, the lead engineer, asked, “You blogged about gearing up to get a certification in Open Telemetry. What is it about OTel that has you so excited?”
I gave a quick answer, but JJ’s question got me thinking, and I wanted to put some of those ideas down here.
OTel Is the Best Thing Since…
Let me start by answering JJ’s question directly: I find Open Telemetry exciting because it’s the biggest change in the way monitoring and observability are done since Traces (which came out around 2000, but wasn’t widely used until 2010-ish).
And Traces were the biggest change since… ever. Let me explain.
See this picture? This was what it was like to use monitoring to understand your environment back when I started almost 30 years ago. What we wanted was to know what was happening in that boat. But that was never an option.
We could scrape metrics together from network and OS commands and we could build some scripts and db queries that gave me a little bit more insight. We could collect and (with a lot of work) aggregate log messages together to spot trends across multiple systems. All of that would give us an idea of how the infrastructure was running and infer the things that might be happening topside. But we never really knew.
Tracing changed all that. All of a sudden we could get hard data (and get it in real-time) about what users were doing, and what was happening in the application when they did it.
It was a complete sea change (pun intended) for how we worked and what we monitored. Even so, tracing didn’t remove the need for metrics and logs. And famous (or infamous) “three pillars” of observability.
Recently, I started working through the book “Learning OpenTelemetry” and one of the comments that struck me was that these aren’t “three pillars” in the sense that they don’t combine to hold up a unified whole. Authors Ted Young and Austin Parker re-framed the combination of Metrics, Logs, and Traces as “The three browser tabs of observability” because many tools put the effort back on the user to flip between screens and put it all together by sight.
On the other hand, OTel outputs can present all 3 streams of data as a single “braid.”
From Learning OpenTelemetry, by Ted Young and Austin Parker, Copyright © 2024. Published by O’Reilly Media, Inc. Used with permission.
It should be noted that despite OTel’s ability to combine and correlate this information, the authors of the book point out later that many tools still lack the ability to present it that way.
Despite it being a work in progress (but what, in the world of IT, isn’t?), I still feel that OTel has already proven its potential to change the face of monitoring and observability.
OTel is the Esperanto of Monitoring
Almost every vendor will jump at the chance to get you to send all your data to them. They insist that theirs is the One True Observability Tool.
In fact, let’s get this out in the open: There simply isn’t a singular “best” monitoring tool out there any more than there’s one singular “best” programming language, car model, or pizza style.* There isn’t a single tool that will cover 100% of your needs in every single use case.
And for the larger tools, even the use cases that aren’t part of their absolute sweet spot are going to cost you (in terms of hours or dollars) to get right.
So, you’re going to have multiple tools. It goes without saying (or at least it should) that you’re not going to ship a full copy of all your data to multiple vendors. Therefore, a big part of your work as a monitoring engineer (or team of engineers) is to map your telemetry to the use cases they support, and thus to the tools you need to employ in those use cases.
That’s not actually a hard problem. Sure, it’s complex, but once you have the mapping, making it happen is relatively easy. But, as I like to say, it’s not the cost to buy the puppy that is the problem, it’s the cost to keep feeding it.
Because the tools you have today are going to change down the road. That’s when things get CRAZY hard. You have to hope things are documented well enough to understand all those telemetry-to-use-case mappings.
(Narrator: they will not, in fact, have it documented well enough)
Then you have to also hope your instrumentation is documented and understood well enough to know how to de-couple tool x and instrument tool y such that you maintain the same capabilities.
(Narrator: this is not how it will go down.)
But OTel solves both the “buying the puppy” and the “feeding the puppy” problem. My friend Matt Macdonald-Wallace (Solutions Architect at Grafana) put it like this:
OTEL does solve a lot of the problems around ‘Oh great! now we’re trapped with vendor x and it’s going to cost us millions to refactor all this code’ as opposed to ‘Oh, we’re switching vendors? Cool, let me just update my endpoint…’
Not only that, but OTels ability to create pipelines (for those who are not up to speed on that concept, it’s the ability to identify, filter, sample, and transform a stream of data before sending it to a specific destination) means you can send the same data stream to multiple locations selectively. Meaning your security team can get their raw unfiltered syslog while it’s still on-premises. Some of the data – traces, logs, and/or metrics – can go to one or more vendors.
Which is why I say: OTel is the Esperanto of observability.
OTel’s Secret Sauce Isn’t OTLP
…it’s standardization.
Before I explain why the real benefit to Otel is not OTLP, I should take a second to explain what OTLP is:
If you look up “What is Open Telemetry Line Protocol?” you’ll probably find some variation of “…a set of standards, rules, and/or conventions that specify how OTel elements send data from the thing that created it to a destination”. This is technically true, but also not very helpful.
Functionally, OTLP is the magic box that takes metrics, logs, or traces and sends them where they need to go. It’s not as low level as, say, TCP, but in terms of how it changes a monitoring engineer’s day, it may as well be. We don’t use OTLP so much as we indicate it should be used.
Just to be clear, OTLP is amazingly cool and important. It’s just not (in my opinion) AS important as some other aspects.
No, there are (at least) two things that, in my opinion, make OTel such an evolutionary shift in monitoring:
Collectors
First, it standardizes the model of having a 3-tier, collector (not agent) in the middle, architecture. For us old-timers in the monitoring space, the idea of a collector is nothing new. In the bygone era of everything-on-prem, you couldn’t get away with a thousand (or even a hundred) agents all talking to some remote destination. The shift to cloud architecture changed all that, but it’s still not the best idea.
Having a single (or small number) of load-balanced systems that take all the data from multiple targets – with the added benefit of being able to then process that data, filtering, sampling, combining, etc. – before sending it forward is not just A Good Idea™, it can have a direct impact on your bottom line by only sending the data you WANT (and in the form you want it) out the egress port that racks up such a big part of your monthly bill.
Semantics
Look, I’ll be the first to tell you that I’m not the world’s best developer. So, the issue of semantic terminology doesn’t usually keep me up at night. What DOES keep me up is the inability to get at a piece of data that I know should be there, but isn’t.
What I mean is that it’s fairly common that the same data point – say bandwidth – is referred to by a completely different name and location on devices from two different vendors. And maybe that doesn’t seem so weird.
But how about the same data point being different on two different types of devices from the same vendor? Still not weird?
Let’s talk about the same data point being different on the same device type from the same vendor, but two different models? Getting weird, right (not to mention annoying).
But the real kicker is when the same data point is different on two different parts of the same DEVICE.
Once you’ve run down that particular rabbit whole, you have a whole different appreciation for semantic naming. If I’m looking for CPU or bandwidth or latency or whatever, I would really REALLY like for it to be called the same thing and be found in the semantically same location.
OTel does this, and does it as a core aspect of the platform. I’m not the only one to have noticed it, either.
Several years ago, during a meeting between the maintainers of Prometheus and OpenTelemetry, an unnamed Prometheus maintainer quipped, “You know, I’m not sure about the rest of this, but these semantic conventions are the most valuable thing I’ve seen in a while.” It may sound a bit silly, but it’s also true.
From Learning OpenTelemetry, by Ted Young and Austin Parker, Copyright © 2024. Published by O’Reilly Media, Inc. Used with permission.
Summarizing the Data
I’ll admit that OpenTelemetry is still very (VERY) shiny for me.
But I’ll also admit that the more I dig into it, the more I find to like it. Hopefully, this blog has given you some reasons to check out OTel, too.
* OK, I lied. 1) Perl 2) The 1967 Ford Mustang 390 GT/A and 3) deep dish from Tel Aviv Kosher Pizza in Chicago