ACT NOW! LIMITED TIME OFFER! OPERATORS ARE STANDING BY! I’m ready for my next adventure as a DevRel advocate / Technical Evangelist / IT Talespinner. If that sounds like something you need, drop me a line in email or on LinkedIn
I’ve specialized in monitoring and observability for 27 years now, and I’ve seen a lot of tools and techniques come and go (RMon, anyone?); and more than a few come and stay (Rumors of the death of SNMP have been – and continue to be – greatly exaggerated.). Lately I’ve been exploring one of the more recent improvements in the space – OpenTelemetry (which I’m abbreviating to “OTel” for the remainder of this blog). I wrote about my decision to dive into OTel recently.
For the most part, I’m enjoying the journey. But there’s a problem that has existed with observability for a while now, and it’s something OTel is not helping. The title of this post hints at the issue, but I want to be more explicit. Let’s start with some comparison shopping.
Before I piss off every vendor in town, I want to be clear that these are broad, rough, high level numbers. I’ve linked to the pricing pages if you want to check the details, and I acknowledge what you see below isn’t necessarily indicative of the price you might actually pay after getting a quote on a real production environment.
-
New Relic charges
- 35¢ per GB for any data you send them.
- …although the pricing page doesn’t make this particularly clear
-
Datadog has a veritable laundry list of options, but at a high level, they charge:
-
$15-$34 per host
-
60¢ – $1.22 per million netflow records
-
$1.06-$3.75 per million log records
-
$1.27-$3.75 per million spans
-
-
Dynatrace’s pricing page sports a list almost as long as Datadog’s but some key items:
-
Grafana, which – it must be noted – is open source and effectively gives you everything for free if you’re willing to do the heavy lifting of installing and hosting. But their pricing can be summed up as:
- $8.00 for 1k metrics, (up to 1/minute)
- 50¢ per gig for logs and traces them, with 30 days retention
This list is neither exhaustive nor complete. I’ve left off a lot of vendors, not because they also don’t have consumption based pricing but because it would just be more of the same. Even with the ones above, the details here aren’t complete. Some companies not only charge for consumption (ingest), they also charge to store the data, and charge again to query the data (looking at you, New Relic). Some companies push you to pick a tier of service, and if you don’t they’ll charge you an estimated rate based on the 99th percentile of usage for the month (looking at you, Datadog).
It should surprise nobody that what appears on their pricing page isn’t even the final word. Some of these companies are, even now, looking at redefining their interpretation of the “consumption based pricing” concept that might make things even more opaque (looking at you AGAIN, New Relic).
Even with all of that said, I’m going out on a limb and stating for the record that each and every one of those price points is so low that even the word “trivial” is too big.
That is, until the production workloads meet the pricing sheet. At that point those itty bitty numbers add up to real money, and quickly.
The Plural of Anecdote
I put this question out to some friends, asking if they had real-world sticker-shock experiences. As always, my friends did not disappoint.
“I did a detailed price comparison of New Relic with Datadog a couple years ago with Fargate as the main usage. New Relic was significantly cheaper until you started shipping logs and then Datadog was suddenly 30-40% cheaper even with apm. [But] their per host cost also factors in and makes APM rather unattractive unless you’re doing something serverless. We wanted to use it on kubernetes but it was so expensive, management refused to believe the costs with services on Fargate so I was usually showing my numbers every 2-3 months.”__– Evelyn Osman, Head of Platform at enmacc
“All I got is the memory of the CFOs face when he saw the bill.”__– someone who prefers to remain anonymous, even though that quote is freaking epic.
And of course there’s the (now infamous, in observability circles) whodunit mystery of the $65 million Datadog bill.
The First Step is Admitting You Have a Problem
Once upon a time (by which I mean the early 2000’s), the challenge with monitoring (observability wasn’t a term we used yet) was how to identify the data we needed, and then get the systems to give up that data, and then store that data in a way that made it possible (let alone efficient) to use in queries, displays, alerts, and such.
That was where almost all the cost rested. The systems themselves were on-premises and, once the hardware was bought, effectively “free”. The result was that the accepted practice was to collect as much as possible and keep it forever. And despite the change in technology, many organizations reasoning has remained the same.
Grafana Solutions Architect Alec Isaacson points out his conversations with customers sometimes go like this:
“I collect CDM metrics from my most critical systems every 5 seconds because once, a long time ago, someone got yelled at when the system was slow and the metrics didn’t tell them why.”
Today, collecting monitoring and observability data (“telemetry”) is comparatively easy, but – both as individuals and organizations – we haven’t changed our framing of the problem. So we continue to grab every piece of data available to us. We instrument our code with every tag and span we can think of; if there’s a log message, we ship it; hardware metrics? Better grab it because it’ll provide context; If there’s network telemetry (NetFlow, VPC Flow logs, Streaming Telemetry) we suck that up too.
But we never take the time to think about what we’re going to do with it. Ms. Osman’s experience illustrates the result:
“[They] had no idea what they were doing with monitoring […] all the instrumentation and logging was enabled then there was lengthy retention “just in case”. So they were just burning ridiculous amounts of money”
To connect it to another bad behavior that we’ve (more or less) broken ourselves of: Back in the early days of “lift and shift” (often more accurately described as “lift and shit”) to the cloud, we not only moved applications wholesale; we moved it onto the biggest systems the platform offered. Why? Because in the old on-prem context you could only ask for a server once, and therefore you asked for the biggest thing you could get, in order to future-proof your investment. This decision turned out not only to be amusingly naive, it was horrifically expensive and it took everyone a few years to understand how “elastic compute” worked and to retool their applications for the new paradigm.
Likewise, it’s high time we recognize and acknowledge that we cannot afford to collect every piece of telemetry data available to us, and moreover, that we don’t have a plan for that data even if money was no object.
Admit it: Your Problem Also Has a Problem
Let me pivot to OTel for a moment. One of the key reasons – possibly THE key reason – to move to it is to remove, forever and always, the pain of vendor lock-in. This is something I explored in my last blog post and was echoed recently by a friend of mine:
OTel does solve a lot of the problems around “Oh great! now we’re trapped with vendor x and it’s going to cost us millions to refactor all this code”as opposed to “Oh, we’re switching vendors? Cool, let me just update my endpoint…” – Matt Macdonald-Wallace, Solutions Architect, Grafana Labs
To be very clear, OTel does an amazing job at solving this problem, which is incredible in its own right. BUT… there’s a downside to OTel that people don’t notice right away, if they notice it at all. That problem makes the previous problem even worse.
OTel takes all of your data (metrics, logs, traces, and the rest), collects it up, and sends it wherever you want it to go. But OTel doesn’t always do it EFFICIENTLY.
Example 1: log messages
Let’s take the log message below, which comes straight out of syslog. Yes, good old RFC 5424. Born in the 80’s, standardized in 2009, and the undisputed “chatty kathy” of network message protocols. I’ve seen modestly-sized networks generate upwards of 4 million syslog messages per hour. Most of it was absolutely useless drivel, mind you. But those messages had to go somewhere and be processed (or dropped) by some system along the way. It’s one of the reasons I’ve suggested a syslog and trap “filtration system” since basically forever.
Nit picking about message volume aside, there’s value in some of those messages, to some IT practitioners, some of the time. And so we have to consider (and collect) them too.
<134>1 2018-12-13T14:17:40.000Z myserver myapp 10 - [http_method="GET"; http_uri="/example"; http_version="1.1"; http_status="200"; client_addr="127.0.0.1"; http_user_agent="my.service/1.0.0"] HTTP request processed successfully
As-is that log message is 228 bytes – barely even a drop in the bucket of telemetry you collect every minute, let alone every day. But for what I’m about to do, I want a real apples-to-apples comparison, so here’s what it would look like if I JSON-ified it:
{
"pri": 134,
"version": 1,
"timestamp": "2018-12-13T14:17:40.000Z",
"hostname": "myserver",
"appname": "myapp",
"procid": 10,
"msgid": "-",
"structuredData": {
"http_method": "GET",
"http_uri": "/example",
"http_version": "1.1",
"http_status": "200",
"client_addr": "127.0.0.1",
"http_user_agent": "my.service/1.0.0"
},
"message": "HTTP request processed successfully"
}
That bumps the payload up to 336 bytes without whitespace, or 415 bytes with. Now, for comparison, here’s a sample OTLP Log message:
{
"resource": {
"service.name": "myapp",
"service.instance.id": "10",
"host.name": "myserver"
},
"instrumentationLibrary": {
"name": "myapp",
"version": "1.0.0"
},
"severityText": "INFO",
"timestamp": "2018-12-13T14:17:40.000Z",
"body": {
"text": "HTTP request processed successfully"
},
"attributes": {
"http_method": "GET",
"http_uri": "/example",
"http_version": "1.1",
"http_status": "200",
"client_addr": "127.0.0.1",
"http_user_agent": "my.service/1.0.0"
}
}
That (generic, minimal) message weighs in at 420 bytes (without whitespace; it’s 520 bytes all-inclusive). It’s still tiny, but even so the OTel version with whitespace is 25% bigger than the JSON-ified message (with whitespace), and more than twice as large as the original log message.
Once we start applying real-world data, things balloon even more. My point here is this: If OTel does that to every log message, these tiny costs add up quickly.
Example 2: Prometheus
It turns out that modern methods of metric management are just as susceptible to inflation.
- A typical prometheus metric, formatted in JSON, is 291 bytes.
- But that same metric converted to OTLP metrics format weighs in at 751 bytes.
It’s true, OTLP has a batching function that mitigates this, but that only helps with transfer over the wire. Once it arrives at the destination, many (not all, but most) vendors unbatch before storing, so it goes back to being 2.5x larger than the original message. As my buddy Josh Biggley has said,
“2.5x metrics ingest better have a fucking amazing story to tell about context to justify that cost.”
It’s Not You, OTel, It’s Us. (But It’s Also You)
If this all feels a little hyper-critical of OTel, then please give me a chance to explain. I honestly believe that OTel is an amazing advancement and anybody who’s serious about monitoring and observability needs to adopt it as a standard – that goes for users as well as vendors. The ability to emit the braid of logs, metrics, traces while maintaining its context, regardless of destination, is invaluable.
(But…) OTel was designed by (and for) software engineers. It originated in that bygone era (by which I mean “2016”) when we were still more concerned about the difficulty of getting the data than the cost of moving, processing, and storing it. OTel is, by design, biased to volume.
The joke of this section’s title notwithstanding, the problem really isn’t OTel. We really are at fault. Specifically our unhealthy relationship with telemetry. If we insist on collecting and transmitting every single data point, we have nobody to blame but ourselves for the sky-high bills we receive at the end of the month.
Does This Data Bring You Joy?
It’s easy to let your observability solution do the heavy lifting and shunt every byte of data into a unified interface. It’s easy to do if you’re a software engineer who (nominally at least) owns the monitoring and observability solutions.
It’s even easier if you’re a mere consumer of those services, an innocent bystander. Folks who fall into this category include those closely tied to a particular silo (database, storage, network, etc); or helpdesk and NOC teams who receive the tickets and provide support but aren’t involved in the instrumentation nor the tools the instrumentation is connected to; or teams with more specialized needs that nevertheless overlap with monitoring and observability, like information security.
But let’s be honest, if you’re a security engineer, how can you justify paying twice cost to ingest logs or metrics, versus the perfectly good standards that already exist and have served well for years? Does that mean you might be using more than one tool? Yes. But as I have pointed out (time and time and time and time and time and time again) there is not (and never has been, and never will be) a one-size-fits-all solution. And in most situations there’s not even a one-size-fits-MOST solution. Monitoring and observability has always been about heterogeneous implementations. The sooner you embrace that ideal, the sooner you will begin building observability ecosystems that serve the needs of you, your team, and your business.
To that end there’s a serious ROI discussion to be had before you go all in on OTel or any observability solution.
<EOF> (for now)
We’ve seen the move from per seat (or interface, or chassis, or CPU) pricing to a consumption model in the marketplace in the past. And we’ve also seen technologies move back (like the way cell service moved from per-minute or per-text to unlimited data with a per-month charge). I suspect we may see a similar pendulum swing back with monitoring and observability at some time in the future. But for now, we have to contend with both the prevailing pricing system as it exists today; and with our own compulsion – born at a different point in the history of monitoring – to collect, transmit, and store every bit (and byte) of telemetry that passes beneath our nose.
Of course, cost isn’t the only factor. Performance, risk, (and more) need to be considered. But at the heart of it all is the very real need for us to start asking ourselves:
- What will I do with this data?
- Who will use it?
- How long do I need to store it?
And of course, Who the hell is going to pay for it?