Transcript
Olimpiu Pop: Hello, everybody, I’m Olimpiu Pop, an InfoQ editor. And I have in front of me Marcin. Marcin, please introduce yourself.
Marcin Grzejszczak: Hey, Olimpiu. My name is Marcin Grzejszczak. I’ve been working on the Spring and Micrometer open-source teams for the past decade, mainly focusing on the topic of contract tests, so Spring Cloud Contract, continuous delivery deployment, and Spring Cloud Sleuth. So, distributed tracing. And recently, for the last couple of years, Micrometer, Micrometer Observation and Micrometer Tracing mainly.
Olimpiu Pop: Okay. So, all the cool stuff that happened in the last decade?
Marcin Grzejszczak: You said it. I did not.
Olimpiu Pop: It’s getting close now. We started with a monolith, criticising its ugliness. And then we realised that at points, monoliths are good, and at other points, microservices are good. But what’s necessary is always to know what’s happening under the hood.
If it doesn’t work in a monolith, it will be worse in a microservices setup [01:17]
Marcin Grzejszczak: Yes, absolutely. And recently, I started doing mentoring sessions. In one of the sessions I had, somebody said that they’re planning their career and they don’t know what to pick. And they said, “Yes, maybe microservices are the way to go”. And I’m like, “If you can’t write proper code in a monolithic app, you’ll make it even worse than a distributed one”. So, I am absolutely not surprised that with the emergence of the trend for microservices, people fault that certain principles do not matter. It’s quite the opposite. It’s always the principles that matter. And if we do not know those, then if we’re going to distribute it, computing distributed systems, we’ll make it far worse over there.
Olimpiu Pop: Yes, I agree with that. Lately, I spoke to two of the pioneers of microservices. Or no, not microservices, but distributed systems. One of them was Sam Newman, who, even though everybody might call him the father of, said he was in the room when the term was coined, but he was not the one who coined the term. But everybody knows his book on microservices.
Marcin Grzejszczak: A fantastic book.
Olimpiu Pop: Yes. Well, the new version is coming this autumn. And the other one is Luca Mezzalira. That’s the same; it’s a father figure when it comes to micro-frontends. Everybody agrees that micro-frontends or microservices aren’t always the best choice. So, it’s better to see what you have and what the trade-offs are that you’re embracing while doing that. But that’s a whole different story. We have a decade, let’s say, of microservices or when it was adopted. In any of the distributed architecture patterns, such as micro-frontends, microservices, or data meshes, some form of consolidation is underway. People are understanding what needs to be done and all this kind of stuff.
And one thing that I see more and more, and I’m pleased about it, is OpenTelemetry. Observability is becoming a key measure for technical people to understand what’s actually under the hood and, in turn, the impact on the end customer’s user experience.
Marcin Grzejszczak: You mean the observability, right?
Olimpiu Pop: Yes, exactly. This is a significant portion of the budget for the last period.
Regardless of the technology, observability matters [03:58]
Marcin Grzejszczak: Yes. Observability, I spent the last 12 years, more or less. Before joining Pivotal, we were creating specific open source projects that also involved distributed tracing. So, part of that was donated to Spring Cloud Sleuth then. For the past 12 years or so, I’ve been working on observability. Consolidation, I do see that. However, quite a few people are mixing terms, which only PU did. Let’s say, instead of saying observability, you said OpenTelemetry, which is the elephant in the room here. OpenTelemetry was founded after OpenZipkin, a service that allows you to visualise latency between different applications, was established. It also involved a standard for header-based tracing context propagation.
Let’s use proper terms here so we’re on the same page. Regardless of the tool you’re using, it’s the observability that matters. I fully agree that, irrespective of whether you have microservices or a monolithic application, without proper metrics or observability, you have no fundamental understanding of your app. You can’t answer any question. So, a business department comes to you and says, “I think our app is slow”. How can you say that if you don’t have any metrics? Is it slower than, what, yesterday, an hour ago? And how slow was it an hour ago? What does it mean it’s slow for us? During the mentoring sessions I mentioned, one person shared with me, “Hey, in our company, people want to rewrite Java to Go, Java apps to Go apps, because Java is slow?” This is what they say.
And I’m like, what? How did they measure that? What does it mean that it’s slow? Do you even have metrics to measure that? But let’s go even further. You actually touched on this topic with microservices, that sometimes it’s not a good idea to start with microservices. Still, it’s often a good idea to ask proper questions before we decide on anything in IT. You want to rewrite from Java to Go? Fine. Do you have people who will support it? How many people know Go? Do you have proof that it’s going to be faster? Let’s ask ourselves a question. Is it even a problem that it’s now in Java? So, I thought we were past this, having made the mistake. That was around 12 years ago, when I attended a conference and discovered MongoDB. And I’m like, wow, this is amazing.
And I came back to my job and said, “We need Mongo”. And my manager said, “Fantastic, what for?” And I’m like, “We need to store JSONs. So, that’s what Mongo does”. And my manager said, “We don’t have Mongo at this point, so will you be the one who’s going to set it up for all of our environments, including production? Will you support this? Will you set up in a high-availability fashion?” And then I said, “You know what? Postgres SQL is great for storing JSONs”. So, this fashion-driven development or conference-driven development is gone. But apparently, still it’s not. So, even though we are a decade in or more than a decade in the microservices approach, some things are still the same. So, I think we need to continue repeating certain obvious facts and certain truths, because there’s rotation within the industry. And apparently, we just have to keep repeating things.
The Lord of Observability [08:05]
Olimpiu Pop: That’s for sure. Thank you first of all for clarifying the confusion I created. I agree with you, observability is critical. In the end, yes, there were two standards. And now OpenTelemetry is the only one in that regard. And that was one of the constellations I was mentioning. But anyway, Java has been around for a long time, and it will remain one of the lingua francas of the internet. It has everything. It grows a lot. And now with everything that happens with the cadence of having two releases per year, you see it adapting. And definitely, it’s essential to understand what’s happening there. So, how did you see the evolution? As you mentioned, you have been here for the last 12 years in the observability space. How did it evolve during this period? How do you see it? Is the direction we are currently in appropriate or still in need of improvement?
Marcin Grzejszczak: That’s a great question. And I was there. It’s like a quote from The Lord of the Rings. I was there 10,000 years ago at the very beginning when there was just Zipkin. And I’ve been there with Adrian Cole. I’m saying hello to Adrian. All the vendors are in the same room, discussing and doing workshops related to observability. Then OpenTracing was founded, as far as I remember, which tried to solve specific problems. However, the biggest issue with OpenTracing was its lack of backwards-compatible changes, which meant that things were constantly broken. Then we had OpenCensus that tried to fix the same thing. And now you have the merge of OpenTracing and OpenCensus, which is OpenTelemetry.
Has it appropriately evolved? I have strong opinions on that. So, if you are interested in me presenting those opinions, I can do so, since I’ve tried to address problems that weren’t there in the first place. The order of actions being taken with OpenTelemetry was fascinating to me, because what is most important is agreeing on things that go with the wire. So, the protocol is the most important. There was already one, so B3 from Zipkin, so W3C was created. Okay, fair enough. The first step was making the API, which defined how it should handle spans and other signals. I found that curious, as the protocol was the most important.
Then, the second most crucial aspect is semantic conventions, but everything was done in the opposite direction. And I don’t want to delve into, let’s say, the specifics of problematic political discussions. Political in terms of politics within the space that we’re talking about, not the politics as such. I found it bizarre when people in the CNCF Slack channel asked about Micrometer, and those moderating the space from OpenTelemetry told others not to ask questions about Micrometer in the same channel. That was very bizarre to me.
All in all, I have never had any problems with any competitors of the libraries or companies I’ve worked with. There is space for everybody. We’re very kind and open-minded people. I’ve been talking to and instrumenting frameworks that are direct competition for Spring, and I have never had any problem with that. Additionally, Micrometer Observation tracing metrics can integrate with OpenTelemetry, indicating our openness to collaboration. But the willingness wasn’t mutual. And here we can put a complete stop.
Olimpiu Pop: Yes. Let’s leave the politics to the politicians. So, the rest is politics, and we’ll keep on the techie side, because I think that’s-
Marcin Grzejszczak: Absolutely.
Observability for the Java ecosystem [12:47]
Olimpiu Pop: … better. So, currently, is Micrometer the way to go for instrumenting a Java application, or even more than that, to achieve observability? Or what do you say?
Marcin Grzejszczak: So, it all depends.
Olimpiu Pop: Obviously.
Marcin Grzejszczak: And then it depends on what? So, Micrometer is a very mature project. The community drives us, and we work closely with the API. So, we do our best not to break any compatibility and we’ve been very successful at doing so. I’m biased, because I work on the project. But that’s an excellent starting point, especially with the Micrometer Observation API. That’s something very nice that we built. I shouldn’t brag about this. But I’m pleased about what we built, because it completely inverts the way you think of doing observability. On the other hand, we support several frameworks, making this process completely transparent to the user.
When we consider developers writing business code, it’s often the case that they don’t need to instrument anything manually. If there is a critical part of the business process that they want to measure in such a way that it’s time-based separately, and they can use the Micrometer Observation abstraction, because a timer, a span, they are very similar in concepts, so that they can instrument once and they have multiple signals there. However, I would utilise the entire Micrometer portfolio for business-related measurements of your application, which you currently have to perform manually, as this is specifically designed for your business needs.
So, from the point of view of technology, you can’t know upfront that, for example, the number of cars that you have currently sold in the last minute is the metric you want to have. Based on that, you should have alerts and dashboards. You should discuss this with your business to determine if it means our business is doing well. What kind of metric is defining that? Let’s say you have great technical metrics, fast responses, and everything is fantastic. Still, your business is in a terrible state because, from a business perspective, the metrics are showing terrible results. So, if you don’t know the answer to this, it means you don’t have proper metrics.
Olimpiu Pop: Okay. So, more or less what I understand from what we discussed until now, and one of your previous points is that out of the box, Micrometer provides you, let’s say the plumbing, where it just looks up at the metrics that you should care when looking at the way how the application runs from a technical point of view. Memory and connections, and so on, are the bread and butter of what it means, what the framework could bring to you. And then you have the opportunity to think like a product developer, where you’re thinking about the benefit for your business. And that’s something you should instrument and ensure makes sense for the company you are running.
Marcin Grzejszczak: Absolutely. So, the way I clarify my thoughts is not that bad. But I liked what you said. As a product developer, especially in the age of AI, the fact that you know how to code doesn’t mean a lot. It’s critical, especially considering the security vulnerabilities that the vibe coding approach introduces. Recently, I attended a conference where several presentations on AI were given. Most of them are related to security. The speakers presented various data points associated with the security of the AI-generated code. And the results were terrifying. As a product developer, your current advantage lies in your knowledge of the business, which AI does not yet possess, nor can it communicate with the industry as effectively as humans can.
So, you emphasise the significant thing. As a product developer, one of the key things is knowing that the product you’re building works fine. So, to answer this question, always ask questions like why and how. Meaning, what does it mean that my product is working fine? How can I measure it? When does it mean that it’s not working fine? What can I do to address and remediate those actions?
Olimpiu Pop: Yes. Being a product developer is something that I’ve been working towards for a long time. So, as long as you focused on observability, I concentrated on product development, which was the main challenge. Usually, developers came, and they were very technical. They wanted to code, they didn’t care about it, and they just said, “Okay, somebody else has to come”. So, they wanted to be code monkeys, where someone else provided the formula or something similar, but they didn’t grasp the implications. And that’s why I am looking at observability, because that gives you… A laser cuts through everything. And then that can provide feedback on how the actual users are behaving. And that allows you to have a rapid feedback loop. But what’s odd is that there is a lot of data, and it’s unclear which one is most important.
Observability in a heterogeneous architecture [18:50]
So, that’s why it’s important, the point you made earlier… When driving a car, you ensure the gas is full and everything is in order, but then it’s essential to get to the destination. And that allows us to drive in the right direction. What would be your advice on using Micrometer? No, probably less than there was some time ago. But one of the promises of microservices was that you can live in a very heterogeneous environment, where those who want to write Go applications for the sake of it or Java applications, or TypeScript, or whatever, can all live together.
Well, it’s challenging because if you have a manager like the one you mentioned earlier, you’ll have other discussions, but some people aren’t that lucky. So, how would the Micrometer look in a heterogeneous environment? How are you ensuring that you have the context moving from one application to the other, so that you have a broad spectrum of everything that you have there?
Marcin Grzejszczak: Sure. That’s a great question, and we can look at that from a couple of points. So, if we’re talking about metrics, that’s not an issue because Micrometer gives you an abstraction over the metric collecting systems. So, if the TypeScript app is using Prometheus, Micrometer can also use Prometheus to aggregate metrics. So, that’s not a problem. You should use the same tags, the same names, and so on. Here we’re quite well-covered. Context propagation is particularly interesting for those unfamiliar with distributed tracing, as it involves setting an identifier for the entire business process. For instance, if the application receives a request, then it moves through 25 other applications. Let’s say we have 26 hops of requests; then we will have one so-called trace identifier. So, it’s an identifier for the same business process, regardless of who is processing that, and we have at least 25 span identifiers.
So, span is, in general, a single operation that is being processed. So, the problem here is that for the identifier to be propagated, it must be propagated by some means. However, the way to achieve this is, for example, by having HTTP enrich headers. The only problem is that if I call it trace ID and Olimpiu calls it correlation ID, we have a problem, because we will not be able to talk to each other and understand that we’re talking about the same thing. This is why standards were formed for non-context propagation. There are different standards. For Zipkin, that was B3, and there’s a W3C one as well. So, in terms of multi-language support, our job as a language or a tool is to ensure that we adhere to the standards, which means that in the headers, if we put the proper data, we have to assume that the recipient will know what to do about this.
So, Micrometer Tracing just delegates work to get a concrete tracer, so the library does the actual work. So, that’s OpenZipkin Brave or OpenTelemetry as such. So, they know how to enrich or what kind of headers should be put into the given transport. Micrometer just delegates work to that. If that work is done correctly, then we can safely assume that if the TypeScript application knows how to propagate that further on, the context will be passed appropriately on. That’s more of a library work. Typically, the user, so the end user, so that developers do not have to know about this nor care, because usually the frameworks will take care of that.
Olimpiu Pop: Okay. So, what I’m hearing from you is that the frameworks, to call them broadly, even though there are libraries and so on and so forth, are doing the heavy lifting, as they should.
Marcin Grzejszczak: Correct.
Olimpiu Pop: And I, as a user, shouldn’t care about that, because everything is happening correctly, as long as I’m using the same approach, whatever. Let’s say we start from the browser where we have the input, we have the first HTTP request that hits the… I’ll go into details, the first application in the chain, and then that cascades all over the place to all applications, as long as I’m using the exact implementation across. So, I don’t know, the …
Marcin Grzejszczak: The same standard. The same context propagation format. Yes.
Olimpiu Pop: Okay. So, it’s about the standard, and then everything is happening. And then the trace ID, it’s passed through the headers, as it should be, because it’s the place for the metadata.
Marcin Grzejszczak: Correct.
Olimpiu Pop: And that will allow me to see those things properly aggregated, and I’ll see the call and then all that stuff?
The devil lies in the details – multi-threaded applications instrumentation [24:41]
Marcin Grzejszczak: Correct. Plus, because the devil always lies in the details, remember about threads. I’m not sure how it works in different languages. I know about Java. But typically, the trace or tracing context, let’s call it, is put in a thread local. And this is the way it’s then retrieved and pushed forward. Suppose you’re changing threads, which will be lost normally, because we instrument things so that you do not have to, in such a way that we know how to propagate things between threads. And so, we created a library for context propagation in Micrometer.
Olimpiu Pop: Okay. That changes things. So, back in the day, when all people were normal, and you had smaller applications, and contexts like that, you had the logs because that was what you had at that point. And you had the thread ID, and that was more than enough. But now, obviously, it’s a multi-dimensional space. On one hand, you’re looking at the distributed level, where you’re just moving from one context to another. So, from one, let’s say, microservice to another or another service. But then, inside those, you might be making different parallel calls, and that needs to be taken into account as well.
Marcin Grzejszczak: Instrument, correct. And let’s keep the ball rolling. What about reactive apps?
Olimpiu Pop: Obviously.
Marcin Grzejszczak: That’s the same problem. And I’ve been working with the problem of instrumenting reactive apps for eight years, nine years. And then Derek Kultgen, so the lead of Project Reactor. Hello, Derek. Thank you so much for your help with fixing the problem in Project Reactor. We shut down for weeks. And analysed with pen and paper to understand the flaws, et cetera. We’ve also managed to integrate proper instrumentation into Reactor processing within the reactive streams. We put a lot of effort into ensuring that people using Project Reactor, Micrometer, and Spring know that everything is working fine. I don’t recall the last time I filed an issue about Reactor’s re-instrumentation, so we’ve finally resolved it.
Micrometer – what’s in the box? [26:56]
Olimpiu Pop: Congratulations. Well, Reactor is being used a lot in the MCP space, with a lot of talk about hype and everything. So, it must be working as expected, especially in the observability space. Then, as a developer, will I have an easy life, and what are the boundaries? You mentioned earlier that you’ve instrumented numerous projects, providing a plug-and-play, out-of-the-box functionality that delivers the observability we need. So, as a developer, I’ll take the library, drop it in, compile it, and run it, and that’s it. And I have to focus on instrumenting only the aspects that I should care about. So, the application’s flow is something you can’t know. Is there anything else I should know, like odd cases that were not covered, they’re very complex to instrument, or it’s too complicated?
Marcin Grzejszczak: So, multithreading, that’s for sure, because you can do some business processing with executor services and things like that. So, then you use libraries, like context propagation, to ensure that the executor service that you’re using is properly wrapped. The process works by capturing a snapshot of the current thread-local values before creating a new thread. And then we restore them in the new thread. If you do it by hand, it won’t work out of the box unless you have the wrapper. If you’re using frameworks like Spring or Spring Boot, then things should work out of the box, because we also instrument them depending on what you have in the class path. So, if you have a Micrometer in the class path, there’s nothing you have to do. If you’re not using Spring, we instrument the value of various components, so you typically glue it yourself anyway.
So, you have to ensure that you’re using our wrappers. And the most important thing is that the vast majority of instrumentation is done in the projects that we are instrumenting. That’s the big differentiator between the philosophy of Micrometer and other frameworks out there, because we believe that having one, let’s say, go to a project that has dependencies on all projects out there in the ecosystem, is unmaintainable. This is why I instrumented Apache Camel. So, the instrumentation is there in the Camel. It’s part of the code base. There are several other projects that we… Resilience4j, for example, is another library that has a similar approach.
Instrumentation should be there because when they add new features, fix bugs, or make other changes, and they break observability, they will know immediately because it’s in their code base. Whereas, if you have a gut project-like approach, which Spring Cloud Sleuth was in this case, that’s unmaintainable, because you know that things are broken post-factor. For instance, I instrument Resilience4j. Resilience4j broke something. And I only know about that the moment when I upgrade their version. It’s already too late, as the users will not be able to upgrade to the newest version due to my broken instrumentation. This is why we completely changed the approach and decided that the instrumentation should take place within the given project.
Also, because sometimes, especially in the Sleuth days, I had to do some hideous hacks to ensure that instrumentation takes place, that I retrieve the headers, I do some stuff with that, and I properly close it. Whereas, if the instrumentation happens within the project, you can have access to package scope classes and things like that. This is how it should be done.
Olimpiu Pop: Well, it sounds normal. This would be like electricity. If it works, you shouldn’t care about it. Only if it doesn’t work do the headaches arise. What else should I know about Micrometer or observability that I didn’t ask at this point?
Marcin Grzejszczak: The key thing to remember about observability, especially in terms of distributed tracing, is that what you produce at the end of the day is data, tracing information or observability-related information, because the same thing is related to metrics. But let’s focus on tracing right now. Ingestion of the data costs money. The more data you ingest, or have someone else ingest, the more you pay. This is why there are frameworks that provide a lot of metadata and tags for your application. And that’s great, because you have more insights into what happens with your application when you look at choices. But on the other hand, remember that at the end of the day, there are vendors who earn money from that. And that’s the right to do so, but the fact that you have a lot of tags implies significant costs.
This is why you should always think about that. For example, with Micrometer, you have an option to have a creative filter that allows you to post-factum add certain tags, or remove them. Always consider the cost, as adding observability can increase it. Then it turns out that you’re paying gigantic bills for that. We, as product developers, exist solely to provide business value, generating revenue for the company. If we don’t consider this at the back of our minds and say, “Hey, we want observability, want as much data as possible”, without calculating the costs, the cost can be much bigger than the benefits.
So, as developers, regardless of the case, whether it’s observability or not, our decisions can have a very concrete impact on the business. We should always think twice before deciding to do anything that could impact the business side. For information about Micrometer, please visit our website at micrometer.io, where you can find the documentation. Last year, we invested a lot of time in rewriting the docs and making them look much nicer. We value our users’ feedback. It’s invaluable to us. So, please visit our issue tracker. So, go to GitHub and the Micrometer project to provide any feedback you’re willing to share, because we want the product to serve you as well as it can.
Olimpiu Pop: Thank you for that. And from what I know, it’s version seven, the current major version of Micrometer, or at least-
Marcin Grzejszczak: Version seven will be the next major version of Spring. So, right now, we have Spring Boot 3.4. And Micrometer, now we have 1.15.
What’s next on the roadmap? [34:29]
Olimpiu Pop: Anything that you’re excited about on the roadmap that would be important in the way things are happening?
Marcin Grzejszczak: The most critical part of our roadmap is following what the users want. So, you can check our issue tracker to see what has the highest thumbs-ups. To ensure that your concerns are addressed in the next minor, simply give the issue enough thumbs-ups. In general, what the users want will be, or most will be.
Olimpiu Pop: Thank you. Thank you for that. So, if we want to have our voice heard, it’s essential to promote it and give it our support. Thank you for the insightful conversation, Marcin. All the best with everything that’s coming your way.
Marcin Grzejszczak: Thank you so much.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.