High-Resolution Platform Observability

Transcript

Martin: Welcome to High-Resolution Platform Observability. My name is Brian Martin. My pronouns are he, they. I work at IOP Systems. My GitHub is github.com/brayniac. I focus on performance and optimization, which means I spend a lot of time benchmarking and profiling and trying to make things more efficient. I work on a few open-source projects.

All of these actually started at Twitter: Rezolus, which is a performance telemetry agent, Pelikan, which is an in-memory caching framework, rpc-perf, which is a benchmarking tool. I think these synergize together. It’s really hard to write a high-performance service without having a good benchmarking tool. You can’t tell where your performance gaps are without good observability. I think it’s been very interesting getting to work on those three, and have them complement each other. My Twitter peers and I cofounded IOP Systems. We’re doing software testing with rich telemetry to get performance insights. It’s similar to what we were doing at Twitter, actually. At Twitter, I did a bunch of performance. Even before that, I was doing hardware and OS qualification performance testing. I’ve been doing this stuff for a very long time.

Platform Observability

When we talk about observability, people like to say that there’s three pillars of observability. We have logs, which tell us when things happen. We have metrics, which give us visibility into our operating systems, our services, our hardware. We have traces that try and follow the path of one thing through our infrastructure and give us the breakdown at times. Platform observability is really more than just those three pillars. We need to understand why we even need these things. Our use cases are generally to answer questions about our infrastructure health, our utilization, and our performance.

None of this data is useful without the ability to visualize, analyze, and report out on that and be able to take actions or fire off a pager. With that in mind, I want to talk about ways that metrics, specifically, because most of my background is around doing stuff related to metrics and not logging or tracing at scale. I want to talk about ways in which metrics for health, utilization, and performance have all led me totally astray and have painted incorrect pictures.

Types of Metrics

Since I’m talking about metrics, I think there are three primitive types of metrics. We’ve got gauges, counters, and distributions. Just so we’re all on the same page, this is going to go quick, but gauges are instantaneous readings. It’s like how fast you’re going down the highway right now. Your speed is going to go up or down. It really doesn’t tell us any history. It doesn’t tell us how far you’ve traveled. You can try and figure that out by measuring. Use a stopwatch and look at your speedometer very frequently, but it’s not the right tool for that job.

Next, we have counters. This is a tally clicker, but your odometer on your car is another good analogy. It tracks how many times things have happened, how many miles you’ve gone. The mileage on your car doesn’t decrease. It’s always monotonically non-decreasing, so it’ll go up or stay the same. It does capture some degree of history. You can use it with time to figure out what the average speed was between two points in time. Similar to gauges, you have to sample at very high frequency to get good insights into things. I’ll touch on that in a little bit.

The last one is distributions, so things like histograms. These measure a whole collection of observations, basically. It’s maybe how fast all the cars on the road are going right now. You’ll have some that are going slower, some that are going faster. Normally, we care about summary statistics from these. Maybe we want to know what the fastest car on the highway is. Maybe we want to know what the 90th percentile is and see if most people are staying within the speed limit. Normally, we use something like a sketch or a histogram to record these observations and produce our summary percentiles.

These types of things, also, sampling interval is going to come into it. When we’re talking about metric usage, we want to know whether our systems are healthy or not. Whether our resources are being fully utilized or if we still have headroom. We want to know this for capacity planning type stuff. Detecting saturation is very important. We want some signals about how things are running on our systems. Since we want our infrastructure to be reliable, we need to know when things have gone wrong. We need to have a little light come up on a dashboard to tell us, you got a problem with your transmission, or you are very low on fuel. Please fix that before you get stranded. It can happen.

Sad Server Story

I want to start with a story about a sad server. We get a call that the database is acting slow. We have performance metrics, so we’re able to look and see that one backend has higher latency. We have a performance metric that shows this. We also have utilization metrics that tell us that the CPU was about the same. Our traffic patterns haven’t changed. We have logs that tell us there have been no deployments. What’s happened here? We must be missing something. Maybe the server is unhealthy. That’s one of the other use cases that our metrics are trying to answer for us. When I’m looking to figure out why a host is unhealthy, I often reach for really simple tools and run through a performance debugging checklist. Since all our health indicators in our dashboards were showing things were normal, our utilization dashboards were showing things were normal, I wanted to, knowing it was a database, start from the ground up. When I think database, I think storage. In this case, I reached for using dd to just read from the database volume and then write to it.

Reads look great, but writes were abysmally slow, on the order of a couple of megabytes per second. Databases don’t like that. No, we didn’t have a health metric that told us this SSD was worn out. When I pulled up the smartmonutils, it was easy to see that this drive had reached the end of its expected number of writes. The failure mode was such that it was throttling things. We had health metrics for our drives, we just didn’t have coverage for all the drive models in our infrastructure. I think that’s a pretty big oversight there. Let’s fix that. Armed with fixed metrics that are getting ingested into our observability system, we’re able to generate a report. We’re able to look fleet‑wide at our SSD wear levels. We’re able to remediate all the hosts that are going to reach a similar state pretty soon. We were not far away from having this happen on many other hosts.

Sad Switch Story

Sometimes you do have all the metrics, though. Sometimes you’re not missing something. This story starts with a report of data corruption. Multiple services were seeing weird strings in responses. In some cases, payloads were failing to deserialize. We got something weird happening. I was looking at a cache node that had a corrupted value persisted into memory. I was able to go look at that. Yes, it was garbled. The cache software is pretty simple. It’s not doing anything wacky. What’s going on here? I start looking at the host health. I was able to see that TCP checksum errors were elevated on the host. It took us a while to realize, but slowly this large team of people who were responding to these actually very widespread reports of data corruption, we all start to realize that we’re looking at hosts that are in the same handful of racks in our data center. That means we have a localized problem.

Since we know it’s a problem with a few racks, we’re able to jump in, drain traffic from those racks. We’re able to get the site stabilized. Now we’re at a point where we need to figure out what’s going on here. Did we have some faulty switches that were corrupting packets? This shouldn’t really happen. Ethernet and TCP both have checksums. Even if a bit’s getting flipped, TCP should detect that corruption, and the packet will be dropped and retransmitted by the sender. We saw that we had elevated TCP checksum errors, so we know that things are catching things. We know we have bad checksums. How can we reproduce this corruption that is making it through to our application? We shouldn’t be seeing this at all.

At this point we’re coming up with some wild theories about just the right bits in the right spots in the packet getting flipped in such a way that the TCP checksum was still valid. It could theoretically happen. I worked on building a tool. Yao helped me by working on the server side of it while I worked on the client. Basically, it was a glorified echo server. It would basically generate a random payload, all plain text so we could easily look at Wireshark and see what was happening. It just signed it with a CRC32 that it also sent along to the echo server. If it got it back and there was a bit flipped, it would be able to detect it. We have a few racks drained. This thing’s able to blast traffic very fast. If we’re seeing this weird type of corruption, we should be able to reproduce it. Of course, things don’t always work so nicely. Even though we were seeing 10% of the packets get corrupted like we saw TCP checksum errors, we weren’t able to actually reproduce this until we ran this test service the way we were running some of our production services.

After moving it into a containerized networking setup, we’re able to trivially reproduce this. We saw it right away after we did that. It turned out there was a one-line kernel bug that resulted in the TCP checksum being ignored in this particular case. Just forwarded through the virtualized network stack to our application, and now persisting corruption into our caches. What could we have done differently in this case? We weren’t looking at rack level metrics. We had metrics. We had TCP checksums. Any of the engineers could have looked at any of these hosts and seen that it was elevated. Service owners probably shouldn’t be doing that. We need to look at the metrics at the right level so we can alert the networking team. Once we had alerting figured out for those, we were able to point out the problems. We were able to get some of our kernel people to submit a fix for that.

Takeaways from Health, Utilization, and Performance Metrics

What can we take away from these stories about health metrics? First of all, coverage is critical. There aren’t really that many different types of hardware or even that many different models of hardware in your fleet. It’s not just reasonable, but totally necessary and feasible to have the good health instrumentation on each of these things. You need a light that’s going to go off and tell you when there’s a problem.

The next is you need to be able to look at the metrics at the right level. You need to scope them along your failure domains. You can make the conclusion, this top of rack switch, all the hosts that are connected to it are seeing TCP checksum errors. Let’s respond to that. Let’s not respond to data corruption after the fact. Shifting over to utilization metrics, I think these are trickier. We care about rates of utilization. We don’t necessarily care about absolute level. We normally have some sort of counter metric and we’re taking a rate of it. This is going to cause its own problems. Here we’re looking at a CPU utilization chart. This was running in a VM, and I produced this for this talk.

Looking at this chart on the far left, we can see that the CPU utilization was a little below 20%. We start getting these spikes in utilization that are getting more intense as we move towards the right. We can start to see that we actually have spikes going up to 100% CPU utilization. I’m going to tell you that this isn’t really the case about what’s going on. Our workload is not actually changing here. This has a steady state workload being applied to it. Where I misled us was that I changed the Prometheus sampling interval.

On the left-hand side, we have minutely metrics. Then we have finer sampling intervals until we get down to secondly metrics at the end. We’re not able to see any CPU saturation until we hit 5-second sampling intervals. This workload was keeping a CPU busy for 10 seconds out of every minute. We can’t see that with minutely metrics. We only see it at 5-second intervals because that’s half of the length of the burst. Nyquist sampling theorem stuff comes in here. You just can’t accurately see that. Then, one second, we’re able to better estimate the actual true length of the saturation.

Looking at another example where our utilization metrics can lead us wrong here. I’m not playing any tricks with bursty traffic. This is a Redis server under a steady state load. It’s just secondly metrics the whole time. This chart’s also lying to us. Really, in that setup, the Redis server was configured with six I/O threads. This VM had eight cores. We’re only using the ones that are all lit up bright white or yellow. We had two cores sitting there stranded idle. That other chart was showing about 75% CPU utilization, so we can see that we weren’t fully loading the CPU. If we thought we had headroom there, we would have been mistaken. Hopefully, those two charts drive home the importance of looking at utilization metrics being a little tricky. Our sampling interval really changes the story of what we’re going to be able to see. We’re operating services that typically have sub-secondly responses. Anything interactive is going to be sub-secondly.

If you’re operating something like cache, you’re probably single or double-digit milliseconds. Your minutely metrics aren’t going to cut it. Even secondly metrics might not be enough. Very brief periods of saturation can really mess up a whole bunch of requests. You can have a service impacting stuff with very brief periods of saturation. If you have something that’s bursty with a regular pattern, you might not ever sample those points that you need. Everything might get smoothed out. Like our health metrics, we need to be able to look at both aggregates as well as disaggregated metrics. We weren’t able to see that CPU was the bottleneck for our Redis server and that there was probably an application configuration issue there, something that could be adjusted. We weren’t able to see that by just looking at overall CPU utilization for that VM or container. We need to look at things at the right scope in order to be able to detect issues like this.

Lastly, metrics that are centered around performance. You can probably argue that performance metrics can be health metrics and utilization metrics, and these are all maybe interrelated to a degree. I think a good example of a performance metric is your request latency.

Digging into that, they have their own pitfalls. We’re not always looking at simple counter and gauge metrics. Often, we’re needing to use histograms. That’s because we care about maybe our tail latencies, maybe our p99.9 latency. Just like looking at our counters, when we look at a histogram to try and make sense of it, we’re looking over some period of time. Especially if you’re defining an SLO around a percentile latency, do you mean that over 1 second? Do you mean that over a minute, 5-minute, 15-minute period that you’re going to hold yourself accountable to your customer? It really changes things. Little brief spikes, little intermittent changes can be totally masked out by the interval in which you’re reporting on the histogram.

A second challenge with histograms is, since you use bucketing to reduce the number of counters that you need to keep in memory, those buckets have width and those give their own imprecision to the values that you’re able to see in them. I already said that our summary metrics can smooth things out. I think that’s really important to understand and think about when you’re defining an SLO or you’re evaluating the performance of your service. We also can’t aggregate summary metrics in any meaningful way. If you’re collecting your p99.9s from 1,000 hosts, there’s really not much you can do with that. You can try and average it, or you can take a percentile of percentiles, they’re both not accurate ways of looking at the distribution of latencies for your service. You need to collect that full distribution from each of those hosts, merge those histograms together, and then you can report out from it.

A lot of the times we only capture summary metrics and we have to do really dirty things with them in our dashboards to try and hope that we’re going to see the right thing. Then, also, our bucket width. If you’re using a logarithmic or a linear histogram, it’s going to have a very big impact on the resolution and inaccuracy across this very wide range of values. We might want to be able to store counts for any 64-bit value in a couple hundred buckets. You can imagine that there’s quite a bit of error that can come in there. You need to be able to report some summary metric. Do you pick the low end of that bucket to report out, upper end? Do you try and come up with some number in the middle of that bucket and hope that represents things? These are tough questions and a lot of things to think about.

Where Is Platform Observability Headed?

With the pitfalls of various types of metrics for our various use cases behind us, I want to talk about where platform observability is going, and what is around to give us better insights into how our infrastructure is running. I think a very important thing is hardware performance counters or PMUs. These tell us how things are running on the CPU. We can track the number of instructions being retired, the number of cycles that have elapsed. We can look at things like our cache hits and misses. We were talking in some of the other talks about how for LLMs the cache hit rate can really be a big factor, and why you wind up so memory bound running inference on a CPU. We can look at our frequency. We can tell if frequency scaling is doing anything weird to us. There’s a whole bunch of other things you can look at. You can look at branch prediction hits and misses. You can look at just all sorts of things that give you some indication about whether the processor stalled out waiting for backend things to happen, waiting for reads out of memory. Where these get tricky is there’s a limited number of PMUs that you can have registered at a time.

If you are doing this just using perf on the command line and you don’t know about this, it’s going to silently multiplex these things together. If you’re trying to do something like divide your cache hits by your cache misses, it might have recorded those numbers from totally different points in your runtime, you aren’t guaranteed to have overlap there. You can do things like pin those counters. If you keep it to a small set, you can actually just look at these all the time. I like to track instructions and cycles and frequency-related stuff a lot. There’s enough PMUs on any modern CPU to do that just all the time. I can guarantee that these counters are always running and that you can get accurate information out of these things.

In a world where power footprint is growing exponentially, it looks like, we need to be able to look at our energy efficiency. We have things like RAPL, which is Running Average Power Limit. We can actually look at things along the domain of CPU or DRAM and know how much power is being consumed. We can look at things like NVML, which is the NVIDIA Management Library that allows us some really cool metrics from our GPUs, such as how much power is being consumed. It tracks this both as a gauge, so we can look at instantaneous current draw. Good to know.

We can also look at the integrated view of that and measure in joules or watts or whatever unit you want to look at, and measure how much a training run is costing you in terms of energy. Like, how much power went into that. You can do this down at the individual inference level and figure out, how much does it cost to generate an image or predict a token? I think increasingly we’ll see these metrics have a lot of importance in our worlds, because contending with the power limits in our data centers is going to be a big issue.

eBPF – Enhanced Berkeley Packet Filter

Lastly, I’m going to talk a little bit about eBPF. I think this is where things are changing most rapidly and where we’re able to get some really cool and unique insights into our systems. The TL;DR about eBPF is, basically we give some code to the kernel and say, run this when something happens. This could be like some kernel function getting called. It could be hitting some tracepoint within the kernel. It could be even like probing a user space function. We can have the kernel do something on our behalf, like increment a counter. Now we can track really fine-grained events, and expose those metrics into user space and collect those into our observability stacks. To show some of the power, let’s look at some examples. Here is a BPF sampler that happens to be attached to the raw tracepoint for when a block request completes. We have this BPF program that’s going to get called every time. It gets some arguments from the kernel. It gets access to the request struct, the error code, and the number of bytes.

If we have that number of bytes, we can do really cool things. We can convert that to an index, an increment to histogram. Now we know our block size I/O distribution. I had a team I was working with once with a hardware vendor, where the hardware vendor was trying to help us optimize the storage performance for this application. When the storage vendor was asking, what does your workload look like? The service team was like, I don’t know, it looks like Hadoop. Doesn’t really tell you what’s happening. They want to know what your I/O sizes are. Are you read heavy? Are you write heavy, 4K, 16K, 32K? It matters a lot. This gives us the ability to dig in and look at these things when our teams don’t actually know what’s going on. We get insight into our workload. We can start characterizing that while we’re characterizing performance, and it’s just really cool.

We talked about block I/O. Obviously, if we trace both when a request completes, we can go back and look at when it was enqueued as well. We can trace that. Now we can measure the latency of these things. We can actually know how long our block I/Os are taking, not just how big they are. Super powerful. This actually would have also caught that SSD that was just being super slow. You can back stomp your health metrics by looking at performance metrics when you’re able to have insight into these things. It’s really powerful. We can also look at syscalls. It’s really helpful to know the sort of system call profile of your service.

You might have an extra read for every request, because you’re checking to see whether you’ve actually run out of data to read from your socket buffer. You can actually profile things this way. You can also look at things like the latency of each of these system calls and see if there’s some kernel performance regression, or something else weird happening on the system. Maybe you have a bunch of mutexes. You have a bunch of locking that’s happening. You can actually look at how long those locks are taking. You can see if the program is sleeping for 100 milliseconds just consistently. It’s really neat when you’re able to dig into these things and understand what the application is doing, across that boundary with the kernel.

Another really cool one, TCP packet latency. This is named after one of the BPF tools that exists in just the main repository of those tools. It’s maybe a little confusingly named, but I like this one a lot. What it’s essentially doing is measuring from the time the kernel says, there’s data on this socket, go ahead and read it whenever. It measures from that point to when your service actually calls read on that socket. Imagine being able to point at something like this and say, no, it’s not the network. No, it’s not the operating system. It’s not the server. It’s your service just taking a very long time to read from this. It’s not an infrastructure problem. There’s something that you need to change about your application. I think we’re going to see just a lot more of this stuff integrated. At what cost? eBPF is touted as being pretty cheap. I think that’s true when you’re doing things like just plain counter increments. I think it’s totally viable to trace every syscall of entry point and increment some counter.

The problem is if you need to do a hash lookup on the path here, a lot of the times you’re looking at the entry and exit points for something where you have some high cardinality dimension to index on, like socket address pairs. You really can’t do much about that. Yao and I were talking, and it’s maybe possible that a better or more performant hash table could be implemented to deal with this case. I think in the meantime, there’s some perf tricks that you can use today to avoid that.

I like to be able to run my eBPF samplers all the time. They’re really great in performance environments, but being able to see things in production is very important. The first trick is to avoid those hash lookups and use arrays wherever possible. If you’re using something, where you’re basically keying on a process or a thread, just use that. Just have an array that’s big enough so you can have values for each of those things. It’s not that much memory, really. You’re talking maybe 32 megabytes worth of memory to be able to store what the last timestamp was for that process. Just index into it directly. Don’t use a hash lookup on that path. It’s going to be way too expensive. Somewhat related is that you can memory map these arrays into user space.

Now you don’t need to make a function call every time you want to read a value from this array because you’re going to want to be able to pull these things in. You’re going to want to read these counters with high frequency, especially if you’re doing some profiling case. You want to be able to mem map it so you can just do plain memory reads. You don’t want to have to make a function call for every value in that array. There is a batch lookup as well, but even that, the overhead is still higher than being able to just memory map. That brings me to the last trick, which is, use plain arrays. It might be very tempting to use, there’s like a per CPU array type. It initially seems really convenient, but you can’t memory map it.

Now you’re paying some pretty significant cost on user space for this telemetry agent that you want running in production on probably a small slice of resource that’s available to run this type of tooling. It’s much better to just use a plain array, use the CPU, the core number, plus some counter offset, and just index into the array that way. This allows you to give each core its own cache line or set of cache lines. You want to pad this out so you don’t have false sharing. Once you do this, your increments for each of these CPUs is still very fast. You’re able to read this and have per CPU counters just across the host. It’s very powerful. These all help make eBPF sampling in production, I think, pretty viable.

I think the future is eBPF. We’re going to see more cases where we need to be able to explain what’s happening on shorter timescales and where we’ll want to know the distribution of things. Maybe we can even do things like look at our performance counters every time we enter some user space function, and then come out of it and see how many cycles were spent in that function. It’s maybe not quite doable right now. There’s still some user probe overhead that you might not want to pay. I think you could do things maybe on context switch when the task switches out where maybe you can do some per process accounting and not have the overhead of enabling perf at the cgroup level. I’ve seen that used in production, and I could actually measure the impact that that sort of setup had on latency for a cache where even it was just enabled for 10 seconds out of a minute. It wasn’t super high resolution. Wasn’t really capturing the whole minute anyway. You could just see your tail latency spike for that period of time.

I think eBPF gives us the ability to provide much cheaper performance counters. We’re able to do away with reading things out of sysfs and procfs. Those are really silly for high resolution sampling. You’re asking the kernel to basically turn all these numbers it has internally, turn it into text so you can read it and then parse it and then store it as a counter again, and then export it again. It’s ludicrous. It’s really silly. It doesn’t work super well. The other thing is those files really don’t have very fine-grained time resolution. Some of those are incremented in chunks that are maybe 100 milliseconds is the granularity, so you might not be able to get those really fine-grained reads of your counters that are going to be important when you’re operating services again in the millisecond latency range.

Then, of course, being able to get our histograms out of things, being able to look at the distribution events, being able to characterize not just the performance of the system but the characteristics of the workload. I think these combined are going to give us new insights to our systems. I use all of these performance tricks in Rezolus, and I’ve found it to be just super helpful.

Questions and Answers

Participant 1: How easy do you find it to correlate hardware performance metrics with software performance metrics?

Martin: That can be tricky. I think that’s where having things where you’re able to start probing that boundary between things is really where you’re going to be able to start seeing. Maybe your writes are slow and you’re able to see that your syscall writes are slow, but you’re also able to look at your block I/O request latency and see that those are slow. I think as we start probing those boundaries between components, whether it’s user space and kernel space or kernel and hardware, and with the ability to aggregate and slice and dice things on the right dimensions, I think those insights will fall out of it. I’ve had pretty good experience using these types of things to find hardware performance issues. While that host with the toasted SSD was up, I actually implemented that BPF sampler, and was like, you can see just how bad write performance is to that disk.

Participant 2: On eBPF, so I’m mostly familiar with it with our Kubernetes workloads, specifically with like OTel and how they’re doing the Go instrumentation. Elevated permissions is always a concern for eBPF because you have to have essentially sudo to be able to get access to the user space. Do you foresee that improving somehow?

Martin: Yes, I’m hoping that the capabilities system will allow us more fine-grained access to things. I think, inherently, especially where we want to be able to probe kernel space like this, it’s just always going to have to be trusted code to some extent. I think there are a lot of cases. These samplers don’t tend to be very long in terms of lines of code, so I think you’re able to do security review of this type of stuff fairly easily. I can see we’re probing into user space runtime performance, that might get a little nasty especially if you’re doing reads of user space data. Obviously, you can do things like intercept your TLS/SSL calls and get access to the plain data. There is some risk there, but I think it comes down to having concise samplers that get access to just what you need to have those metrics that you care about. Hopefully that keeps the security footprint small enough that you can manage it.

Participant 3: Do you have any tips for looking at histograms over time? We look at the histogram, like right now, but I want to compare it to like, what is it 12 hours ago? What does it look like off-peak?

Martin: Histograms always get hard that way because you wind up mixing things with different dimensions. Like if your workload has changed, your volume of requests has changed, it’s very hard to compare those distributions. I was just talking to someone, where basically their QPS had changed and they wanted to look at the distribution between these two things. They had actually seen that their tail latency appeared to increase, but this could just be that, let’s say, things are taking a slow path in some cases, and you just always have this baseline where you have this noise where things are going slow path.

In the happy path, that happy path traffic decreased and it shifts those percentiles. I think that’s really hard. Looking at heatmaps in something like Prometheus definitely gives you some good insight, but I think there are always mental hazards in terms of interpreting this data. That’s the case for counters and gauges and all that too. I think there’s some level of educating people that that can happen.

Then, being able to disaggregate those things. If you know you have things where you’re serving your cache on your network, your Redis instance return to value at some point. If you’re able to split those out from the cases where you had to go to the database and look at the distribution of those two things separately, I think that gives you much better explainability. I think it always comes down to what level you’re looking at the metrics.

Participant 4: How many people are using eBPF in production? Because we are really looking at, in production, what tool you guys use and make it useful. Even from the AMD side, I’m looking into it.

Participant 5: Since you have used eBPF extensively, can you name some limitations of this?

Martin: Having to know what kernel functions you want to trace is always a big limit. I don’t think tracepoints are always where you want them to be. You’re dealing with an unstable interface between, because you’re looking at just what the kernel is doing in itself. It’s not like a stable ABI. I think there’s always an issue there, where like, the task structs, the fields in that change, so you have to handle that. The kernel can change things. If you care about your kernel probes a lot, that’s going to be unstable. I think for a lot of the core stuff, hopefully you don’t hit that too much, but it’s definitely one of the issues.

One of the other things with eBPF is just since it’s such a limited set of things, you have to be very disciplined in what you’re writing. You can’t have for loops and stuff like that. To index into that histogram, I had to write a function that was basically just branching in to count the leading zeros. You actually can’t call the built-in to count leading zeros either. There’s a lot of limitations in what you’re able to do in BPF code. I think between the instability in the interface that you’re trying to probe, the limitations of what you’re able to write and pass through to the verifier. If you’re looking at something like uprobes, the fact that there it’s basically done with breakpoints that has a lot of performance overhead still, there’s some work being done to try and make that better. I haven’t kept up with where that is right now. A lot of what I get most use from is kprobes and tracepoints.

Participant 6: I work at a smaller shop that’s not going to have a low-level engineer doing this stuff themselves. Do you think we’ll see this bubble up to OS or Kubernetes cluster level visibility where we won’t have to write the entry points ourselves?

Martin: Yes. Rezolus is open source. It’s available on IOP Systems’ GitHub. I think you could run that. The actual BPF samplers are very concise there. You can do your security audit if you’re compared about the surface for attack there.

See more presentations with transcripts

High-Resolution Platform Observability

Transcript

Platform Observability

Types of Metrics

Sad Server Story

Sad Switch Story

Takeaways from Health, Utilization, and Performance Metrics

Where Is Platform Observability Headed?

eBPF – Enhanced Berkeley Packet Filter

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

If You're Still Running Windows 10, You Need to Do This One Thing Before Oct. 14

ChatGPT encouraged Adam Raine’s suicidal thoughts. His family’s lawyer says OpenAI knew it was broken

GTA 6 In India: Expected Prices, Release Date, Platforms, Maps, And Characters

FTC Chairman Suggests Gmail Spam Filtering Has It In for GOP Candidates

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Platform Observability

Types of Metrics

Sad Server Story

Sad Switch Story

Takeaways from Health, Utilization, and Performance Metrics

Where Is Platform Observability Headed?

eBPF – Enhanced Berkeley Packet Filter

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News