Recently, I made a comment about the idea of there being a “best” monitoring tool:
In fact, let’s get this out in the open: There simply isn’t a singular “best” monitoring tool out there any more than there’s one singular “best” programming language, or car model, or pizza style.* There isn’t a single tool which will cover 100% of your needs in every single use case.
The comment got some pushback, both privately and in a few forums, so I wanted to dig into what I meant and why I felt that way. But before I do that, I want to set the record straight: I stand by what I said about there being no “best” monitoring tool. But I was flat-out lying about the other stuff. The hills I’m willing to die on are:
- The best programming language is Perl
- The best car is the 1967 Ford Mustang 390 GT/A
- The best pizza style is deep dish, and I’m partial to getting it from Tel Aviv Kosher Pizza in Chicago
With that cleared up, let’s get back to monitoring and observability.
Zoom Zoom
What really got me thinking about the false concept of a “best” monitoring tool was a video my son shared with me, comparing a Lucid Air Sapphire, a Bugatti Chiron, and a Tesla Plaid to see which was the fastest production car of all time.
Disclaimer: I am NOT a car person. My son Kaleb (who is 21, in his 2nd year of university to become a mechanical engineer, and on two different Baja SAE teams) very much is.
After watching the video, Kaleb pointed out that if the race (which was on a quarter-mile track) had been a half-mile instead, the Bugatti would have won, hands down. This was because (reasons. I honestly couldn’t follow the things he was saying at this point. I leave it to the reader to imagine Star-Trek-like technobabble)
My point in all this is that “best” – the fastest car in this case – is highly subject to other variables. The type of track (this was on a track that had been pre-treated with VHT, which makes it super sticky and affects traction), the distance, and even things like altitude and weather – these all can impact the ultimate outcome.
But I’m talking about more than external factors. What is “best” can be affected by the ultimate use case. Cost is the easiest one that comes to mind. Yes, a Bugatti might be the fastest. But, possibly the “best” car is a Honda Civic because you value cost and reliability over speed. Or perhaps a Kia Sedona might be “best” because you need more seats. Or a Ford F150. Or a Ryder 16-wheeler.
This all relates to monitoring and observability in ways that are both important and, sadly, novel to a lot of IT practitioners. We get so caught up in the “speeds and feeds” aspect of tools and solutions – how many flows per second, how many traces per collector, maximum ingest before backpressure occurs – that we often fail to stop and say, “Do I need that? Will I ever need that?”
I shared this with my friend Kevin Sparenberg (another car guy, who writes occasionally on his blog: blog.kmsigma.com), He added:
Monitoring tools are like any other tool. You have your favorite hammer, but it’s not appropriate in every scenario. You don’t (or shouldn’t) use a sledge for setting a nail, nor would you use a claw hammer for forging. You might have a favorite, but you need to consider the type of work you need accomplished.
(Besides being a car guy, he’s a home-repair weekend warrior. This is just one of the many reasons why we’re friends.)
“Best” Isn’t Always Best
Part of the blame can be placed at the feet of vendors. I’ve worked at a few, and it’s rare to find one that has the tools to help customers quantify volume and cost before implementation. Most simply say, “let’s just get this installed and we’ll see where it lands and we can tune from there.”, blithely ignoring the way the level of effort (not to mention political maneuvering in the C-suite) to implement a new tool makes sunk-cost fallacy a near-certainty.
But that’s only part of the blame. The other part rests at our feet – the monitoring engineers who need to better shoulder the responsibility of understanding and speaking for the needs of our organization. Because if we don’t, who will? Much of this comes back to things I’ve already ranted about:
If we don’t have a plan for the monitoring and observability data being collected, any cost is going to seem to be too much. If you have a plan, you’ll know exactly how much the data is worth and be able to evaluate the cost of a tool.
Learn to speak the language of business, to frame things NOT in the technical terms that you find familiar and comfortable, but in terms that make it clear to the business why a new tool is needed and the value it will provide.
Solve the problems your organization is actually having. Once again, it’s easy to get swept up by a vendor’s vision. But if your company isn’t having any of the problems that vendor vision describes, it’s all wasted time and money.
Kevin added another nuance. Hyper-focusing on a single metric isn’t just sloppy, it can lead to real problems down the road:
I would even go so far as to revisit that zero-to-60 metric. What about zero-to-60-to-zero? Sure any car can get to 60, but which car is able to do that AND return to status quo quickly. If you need a good “bad” example, look at the Gen 1 for Dodge Viper. As my dad said, “plenty of giddy-up, virtually no whoa.”
How does that translate to monitoring and observability? Think about alerting. LOTS of tools are able to detect and trigger alerts based on extremely specific (and sensitive) triggers. But far fewer have the controls to detect and stop alert storms.
My point in all this is to remember that “best” always has to be weighed against YOUR values:
- Your real-world, actual business needs
- Your cost-to-benefit ratio
- Your team’s skillsets
- Your tolerance for toil and effort during the transition
- Your willingness to support one more tool in perpetuity
- …and so on.
Good Enough is Usually Good Enough
There is a story that is equal parts old, hilarious, and fake. It involves a hapless young man (it’s ALWAYS a dude) who decides to mount a JATO rocket to his car to see just how fast he can go. And in the ensuing chaos, this young man (supposedly) earned himself a posthumous Darwin Award.
(Once again, I have to emphasize that this story is 100% fake and was even debunked on the very first episode of MythBusters)
However, one of the aspects of the story that makes it funny (at least in my opinion) is the sheer ridiculousness of it all. Sure, lots of folks want a fast car. Many of those folks are willing to spend a little extra for a car that’s a little faster than the norm. Fewer (but not zero) people might also be willing to go to great lengths to acquire not only a fast car but “the fastest” car.
But strapping a rocket to the top of an old car – one that was clearly never intended to be used this way? That is some serious, janky, automotive slapstick. It tickles our funny bone by evoking those grainy sepia-tinted images of early 20th-century “flying machines” that were nothing more than 2 umbrellas strapped to a piston. As the final image of the JATO story fades in our mind, we can almost see the end card saying, “You just couldn’t leave well-enough alone, could you?”
Likewise, we need to foster a habit of self-restraint and technical reflection in our monitoring discipline. We have to recognize when our excitement about a tool’s ability to process 8 million log messages a second clouds our own ability to step back and say, “Why do I even HAVE 8 million log messages a second?” And if there’s a good reason for that volume, follow up with the question “Why do I need to send every single one of those messages across the internet to a vendor’s storage?”
I’m not saying there’s nobody in the world who might need that. You might have a good reason. I just want to suggest you take a second to consider things before you end up creating your own JATO-powered observability disaster.
The Solution is Both-And, not Either-Or.
This is the thing most vendors don’t say, at least admit within earshot of investors and members of the board, because it flies in the face of the marketing hype and sales pitches they’ve worked so hard to craft.
You need more than one monitoring and observability solution. You need to decide which systems (and more fundamentally, which data on those systems) use each tool. You’re going to have to split your budget (time, effort, skills, money) between those tools.
This is an ugly but unavoidable truth. Over 27 years of working with monitoring and observability tools, the number of times I’ve seen a company that had exactly one monitoring solution is: zero. Even the ones that insist they do, after a little digging, have at least a few pockets of the environment that use something else, whether that’s a class of system (mainframes, minis, Windows NT servers, 6509’s); an organization or team that just went their own way or was acquired but never fully integrated; or a location that – due to their distance from the main organization, either functionally or geographically – has to maintain their own set of tools. And most orgs don’t have “at least a few pockets”. They have a full suite of overlapping solutions.
You are going to have – more likely, you already DO have – multiple monitoring solutions in place. This is one of those tech realities that is simple but not easy – like supporting more than one operating system (whether servers or desktops), moving from branch-based development to feature flags, or building a multi-language, multi-cloud app. It’s the cost of doing real business in the real world.
Observability, like life itself, is messy. It also, to quote Ian Malcom, finds a way. So will you. Here’s how:
Plan and prepare to identify data types based on complex criteria: it might be a combination of location, system type, data type, and even time frame. Expect that, based on those parameters, you’ll then filter and transform the data before sending it in the correct direction.
Also expect that some data types are so valuable, you’ll end up sending the same data in more than one direction. But also prepare to put boundaries in place so you aren’t doing that all the time because that gets expensive fast.
As I said, you need to be ready to support multiple tools, but you should have a plan in place for how you’ll keep track of those tools, and identify their primary use case, and even set boundaries on the things they will NOT be permitted to monitor, so your stable of solutions doesn’t explosively get out of control.
One way to do that is to set, for every data type or use case, a definitive choice for a primary tool that handles that data and a secondary one that you use as a gut-check. For larger organizations or more important data sets, you might have a tertiary, but draw the line there.
Another way (complimentary to the first) is to understand that some tools are cheap and do a lot of things mostly ok, so you can spread them like peanut butter across the enterprise, while others are expensive (in time, effort, or money) and only do certain things well. So you should spread THOSE like caviar – only on the systems where they’ll do the most good.
Finally, differentiate between management tools that also have a monitoring component and true monitoring and observability solutions. You shouldn’t get rid of management tools, but you also shouldn’t make them the primary source of truth for enterprise monitoring information because they are usually so vendor or system-specific, and it will lead (again) to that explosion of tools I cautioned against earlier.
A Brief Buyer’s Guide
A natural question to ask next is, “How do I know WHICH tools to get?”
Once again, my buddy Kevin has some wise observations:
Is it even worth mentioning bake-offs? Do people even do that anymore? Maybe tool A has these features, but tool B has these other ones. And they both share some other capabilities. We need all of it, but can’t get it in one package.
There’s nothing wrong with that. Leon’s point about having multiple tools is on target. Bias your decisions to picking the right tool for the right job (but at the same time, try not to collect too many tools. This ain’t Pokémon).
Don’t buy for “this tool has this neat feature we don’t need, but maybe someday we’ll want.”. Buy (or deploy) for what you need now and in the near future. IT (and the business) is always growing and evolving. What you THINK is important today may be irrelevant in six months. Hopefully not, but thinking too far ahead – the infamous “five year plan” is just a waste of your effort and time.
Taking a Victory Lap
The point of this blog is pretty simple: There’s no such thing as “best”, and that goes for everything from cars to programming languages all the way to observability solutions. But more essential is my point about WHY there isn’t a single, specific “best” – it’s because context matters. Use case matters.
To be more nuanced, there probably is a “best,” but what is best is extremely particular to you and your circumstances. So, the lesson in all this is to make sure you are clear about those circumstances and that you’re always weighing them against the “we’re the best” marketing hype you’ll hear from many vendors.