Practical Benchmarking: How To Detect Performance Changes In Noisy Results

Transcript

Fleming: Who can see a face in this image? What about this one of Danish electrical sockets? This one of some birds? That’s weird, we all know there’s no actual faces in these images, but then all of you will pretty much put your hands up and said you could see that. It turns out, the human brains are constantly searching and scanning for faces all of the time, and this particular phenomenon is called pareidolia. This is the tendency for your perception to try and attach meaning to the things that it sees. Your brain is constantly looking for signal in noisy images, noisy scenes, and things like that. I think that as performance engineers, we have a very similar problem.

Anybody that’s spent any time looking at performance charts, it’s constantly searching for performance changes, regressions, optimizations in the data. Just like your eyes can be tricked when you’re looking for faces, and you’ll see faces in things where there are no faces, we constantly get tripped up by this, and we see regressions where they don’t exist. This is a time series. Your eyes are probably drawn to the spike there. That’s like an interesting data point. I’m not quite sure what’s going on there. It would depend on the test and the system and a bunch of other factors, but I’m sure tell you now that this is noise and not repeatable. It’s a total fluke of the system. This actual data point could have come from absolutely anywhere.

Unlike with the faces in the images, we don’t have millions of years of evolution to quickly filter this out. We don’t know within an instant that there’s no actual face there. This takes much more critical reasoning. It’s much harder to be disciplined and teach yourself that actually you need to do something else. You can’t just eyeball the charts. You wouldn’t ever want to sacrifice this ability. Though your eyes can lie to you, they can trick you into thinking you see things that don’t exist, you still want that ability to look across a crowded restaurant and see an old colleague, an old friend. You wouldn’t give up that for anything in the world. By the same token, your eyes, your intuition, and your pattern matching capabilities are incredibly valuable if we look at performance data.

Background

This talk is going to cover noise and benchmarking. We’ll look at, what is the origin of noise? How do we define it? How do we measure it? If we can ever truly eliminate it? Also, the meat of the talk is going to be, how do you deal with noise in your performance data?

I’m co-founder and CTO at Nyrkiö. I used to be a Linux Kernel maintainer for a number of years. I’ve been a part of performance teams for various companies around the world. I’ve written a couple of research papers on automated distributed systems testing and performance testing. There’s a big caveat with this whole talk, which is that I’m not a mathematician. I have friends that are mathematicians, but everything in here I’ve approached from having a benchmarking issue and trying to find the mathematics to analyze it, to find solutions to the problems. I don’t recommend you do that. It’s incredibly time consuming and laborious. Just steal everything out of this talk.

What is Noise, and How Can We Measure It?

If we’re going to talk about noise, which we are for the rest of the presentation, we need some working definition. I really like this one from Nate Silver’s, The Signal and The Noise, which is, “I use noise to mean random patterns that might easily be mistaken for signals”. I think this captures quite well the problem I talked about with pareidolia, which is seeing things that aren’t there. Because crucially, what we do as performance engineers is we will make decisions based on our analysis of the data. If we make decisions based on bad data, we’ll get bad decisions.

This is incomplete, I think, for our purposes, because I personally talk about results being noisy, like there’s a lot of noise in these results. These are noisy results. I don’t think this fully covers that, because when we talk about noisy results, what we’re really talking about is variation in the results. What I mean by this is, if your results have a wide range of values or are widespread or a dispersion, then that can make analyzing the data much more tricky. This is great, because variation is a mathematical concept, which means we can use math to solve these problems and to work with this data.

You’re all pretty familiar with standard deviation. One of the measures of dispersion or spreading results that I really like is coefficient of variation, which is just the standard deviation over the mean, times 100. It’s quite simple to calculate. NumPy has got functions for calculating this. Crucially, because you’re normalizing over the mean, you get a percentage, which I personally found is quite intuitive for describing how noisy results are. You can say like the throughput of this benchmark is 50,000 RPS plus or minus 5%, and because it’s normalized over the mean, it allows you to compare across different runs of tests as well, which is quite helpful. Anybody who’s spent any time in Ops has probably seen this mean and the denominator, and is freaking out a little bit.

We all know that means are pulled either side by outliers, by the max and the min, and it’s not what you would call a robust measure of dispersion. What this means is, you can assume, for a normal distribution, that the statistically significant values fall within a certain range of the mean, and anything outside of that is unlikely. It’s a good way to detect performance changes, because if you have a very unlikely by this distribution value all the way over here, there’s probably something going on there that warrants analysis.

This distribution, if you are familiar with it, this is the normal distribution. Unfortunately, many results, many benchmarks and datasets don’t actually follow this distribution. We need something that’s more robust. We need a non-parametric test. Because if you have a cache and you have a fast path and a slow path, you get a bimodal distribution, so you get two of these. If you have many services and you have a fanout in a microservice architecture, and you’re hitting many services for your request, you’ll get a long tail in latency, which, again, doesn’t follow this shape.

We need something that doesn’t rely on the mean. For that, I quite like the interquartile range, if for no other reason than when you’re doing benchmark runs, and you’re collecting metrics, usually you’re collecting percentiles for something. I know I do for latencies. The IQR is just you take the 75th percentile and subtract the 25th and that’s it. It’s a really straightforward measure of dispersion around the middle half. It ignores outliers, or if you flip it around, it’s a good way to detect outliers, because you’re ignoring the things on the outer edges. It doesn’t suffer from that skew that you get with the mean.

If for some reason you want to include all the data, because remember, this is a measure of the middle half of your dataset. It measures things around the median, and ignores the outliers. It is simple to calculate, but maybe you want something more all-encompassing, slightly more complex, but still a really good, robust measure is the median absolute deviation. It is more complex, so I’m going to describe it.

Me as a non-mathematician, this is scary. You measure the distance of each of the data points from the median. You just subtract the median from it, basically. You get a new set of data points, and then solve them, and then you find the median again. It’s like two levels of median. What that ends up doing is you get a much more stable measure of dispersion because you focus on these medians. It is more complicated, but it uses all the data points. Depending on what your performance results look like, this may be a more accurate, useful measure.

Where Does Noise Come From, and Why Can’t You Escape It?

At least now we have some way to measure noise in our system. Since we can measure noise in the benchmarks, maybe we want to look at where it comes from. Maybe we’d want to do something like minimize the noise in a test. There’s a few places where, in my experience, noise come from, and it pays to be aware of them, because, unfortunately, the answer to this question is absolutely everywhere. The reason for that is because of the non-determinism in modern systems. What’s interesting about this is a lot of the places where you get non-determinism, and what I mean by that is this is usually in relation to the execution order of something, like you can’t guarantee the exact sequence of events happens the same way every time. The reason we have this usually is for better peak performance.

If you think of caches or sharing of resources, things like that, where if things are idle, you try and use that resource efficiently, that leads to non-determinism, because it depends on the load on the system. It depends exactly what’s happening. It depends on the events before it. This is the source of a lot of the problems in benchmarking. Crucially, in modern systems, it’s fundamental. No one’s going to sacrifice this peak performance, not usually anyway for more stable results. We have to have ways of dealing with this. In my experience, the number one place, and it still trips me up, like at least once every three or four months, the number one place that benchmark result instability comes from for CPU bound benchmarks is the CPU frequency changing.

In modern CPUs, the clock speed of a CPU is related to the power that it draws. The higher the frequency, the more power it draws. What modern operating systems will do is they will crank down the frequency to save power. This makes sense if you have a light load in the system, but during benchmarking, usually, there are periods where we have a very heavy load in the system. What the operating system will do, unless you tune it, is it will start off as a low frequency, and then once the OS notices that there’s a high load in the system, will gradually start cranking up the frequency. This isn’t instantaneous. It does take some time.

Unfortunately, for us as performance engineers, it’s an indeterminate amount of time every time you do it. There’s no guarantee it will happen exactly the same way every time at the same point in the test. This creates random fluctuations in your charts and your metrics. If you are lucky enough to be running benchmarks on bare metal machines, and you own the hardware, you can lock the frequency. Linux in particular has CPU frequency drivers, and you can pin the frequency to a specific value and basically get around this whole problem. If you run benchmarks in the cloud where maybe you don’t have permissions to do that, then this is going to be an issue for you.

Here’s a little example of this. I had a REST API based benchmark running in a public GitHub runner, and measured the CPU frequency during the benchmark run. You can see, hopefully, it bouncing around as the benchmark runs in relation to the load and exactly what’s going on. As you might imagine, this produces really unstable results. Crucially, they’re unstable in different ways on different runs. If you run the same thing twice, you get different amounts of noise, which is really problematic, if you’re trying to do things like figure out whether the software change you just introduced, introduced performance change. It makes it very difficult to do that.

The takeaway from this is, don’t run benchmarks in GitHub runners. Garbage collection is another source of noise in benchmark results. I know things are much better recently, like with recent versions of Java, for example. You can tune this. It’s amazing how many times this catches out people, particularly for older versions. I was running a throughput-based benchmark on Apache Cassandra, which is written in Java, and saw these dips in the throughput, and figured out these dips are actually related to garbage collection cycles, but they line up pretty well. I’m sure many of you are still running a Java 11, maybe, Java 8? These are still problems. I know things are better in more recent versions. Maybe the last one to cover, just generally, is resource contention is also a major source of noise.

In particular, as a former kernel maintainer, I often think about resource contention at the lock level, like locks that are contended. It scales up to distributed systems and microservice and things like that, so you can have services that contend on specific endpoints in different microservices. The really nefarious thing about this one is that while you measure the CPU time or the request latency for your particular benchmark, resource contention often happens because of totally unrelated entities accessing the resource. It’s almost like a hidden cost. This can make it really tricky to eliminate that noise and try and get useful results.

One solution to all these problems with like, where does noise come from, and maybe the metrics we’re using are flawed somehow, is, I’ve heard people advocate for proxy metrics. Proxy metrics are actually pretty good. They do help in certain situations. When I say proxy metric, I mean you’re measuring something that the user doesn’t care about, but you do. Which is a crass way of saying, you’re measuring something that has no visible user impact but makes your life easier as a developer, or a performance engineer, or somebody who is responsible for the performance of software.

A good example of this would be like instruction per cycle, maybe allocations, things like that. There is no one obvious way to do it, no one clear way to do it. There’s no winner. There’s no silver bullet. Proxy metrics also suffer from problems. Here’s a C++ microbenchmark for a Kafka compatible streaming application. There’s the time series, so you’ve got the benchmarks being run, after every commit for a couple of months or so. You can see with your eyes these two bands of values. There’s like a high one, ran about 1.35 million, and there’s a low one with about 1.2 million. This chart does actually illustrate why you wouldn’t ever want to completely give up your ability to eyeball charts, because it’s a bad idea to do it initially. I don’t believe in eyeballing charts to find regressions.

If you use other methods that we’ll talk about later to find regressions as part of the analysis, this is really powerful, because everyone I imagine figures out it’s a bit weird. It’s either noisy or there’s something off with the results. What I would hope to have seen, assuming no performance change, is just like a really stable line somewhere. This is a proxy metric, because what we’re actually measuring in this chart is instruction counts for each of the runs. You run the benchmark and you count how many instructions that took. It looks like there’s two values. What I found out on some analysis was that if you run the AWS m5d.12xlarge instance, in the fine print of the documentation, it says you will either get a Skylake or you will get a Cascade Lake. This explains why we have two bands of values.

If you get a Cascade Lake, you use more instructions for the exact same code. If you use a Skylake you use less instructions. I never found out why. I just found out this. One thing to do with eyeballing charts is your choice of visualization is also really important, because maybe you can see the bounding there. I don’t think it’s completely obvious, like you could probably have a good guess that this looks like two different values. If you did something like a scatter plot, then it looks a lot more obvious. You see this banding. Having the right visualization tool is also really important. This is the problem with proxy metrics, or at least one of the dangers to be aware of, is this is a different metric for the exact same runs on the same hardware.

For each of the runs, we measure instructions, and we measure request latency in nanoseconds. The problem with proxy metrics is that they are proxy metrics, and they don’t measure something your user cares about. Your user probably cares about time. Proxy metrics usually are something that’s not time-based allocations or instructions. The user doesn’t care which microarchitecture run. They don’t see that difference. It’s completely missing from this chart. Proxy metrics can be good, but they are not without some rough edges.

We talked a lot about the nature of noise and how to measure it. I like to have this mental model of the different kinds of performance tests you can run. Microbenchmarks will be analogous to unit tests, something very small. The surface area of the thing you’re testing is very minor. Maybe it’s one function, maybe it’s a couple of instructions, but definitely no I/O, something very small. As you scale up the scope of the thing you’re testing, you go to benchmarks. Then, end-to-end tests really should be representative of what your users see. I’m talking like full-blown server and client architectures. Maybe you use a staging environment, but it should be totally representative of the things your users and customers are doing.

Each of these types of microbenchmarks has within it an intrinsic amount of noise because it’s linked to the scope of the test. As you broaden the scope and you go from microbenchmark to benchmark to end-to-end test, the amount of stuff you’re testing is bigger and the sources of noise gets bigger too. What I found is usually the microbenchmark can be tuned and pushed to have a minimum amount of noise. It’s true that microbenchmarks are sometimes more susceptible to noise because you’re measuring a very small thing on very small scales.

Things like cache alignment really show up here, but you tend to have more control over cache placement with microbenchmarks so you can squeeze out the noise. It’s less so the end-to-end test side of the spectrum. That’s actually a worthwhile tradeoff, because while you can minimize noise for the end-to-end tests, I’ve seen InfoQ talks from the financial training folks, they minimize object allocation in Java in the fast path. All that’s really good. Crucially, end-to-end tests should match whatever noise your users are seeing. If you’re going to do it in your test, you need to do it in production too, otherwise, it’s not truly representative.

Detecting Performance Changes in Noisy Data

Let’s talk a little bit about detecting changes in performance data, particularly since hopefully I’ve convinced you that some amount of noise is going to be there. You can’t really escape it. You can tune things, but you’re going to find noise. Hopefully I’ve convinced you why you wouldn’t do this as a first step. Just don’t do this. It’s good for a subsequent analysis. You don’t do this initially. It’s too easy for you to find false positives. It’s time consuming. It’s super boring, I found, like just staring at charts every morning. There are better things to do with your time. Don’t do this. The next step up is probably thresholds. We talked about measures of noise.

Maybe you’re thinking, I could use that measure of noise, maybe multiply by a fudge factor, and anything outside of that is a potential regression. That’s definitely one way to do it. It’s certainly better than eyeballing things. You’re using a bit of statistics there. I have found that thresholds tend to balloon. If you have a system that works well and you’re getting value from it, you have a threshold per test.

Sometimes you’ll expand to per metric per test, sometimes per metric per test per release. It very quickly explodes and gets out of hand. These all require a certain amount of tuning as well, particularly if you can imagine, you have a huge shift in the performance, because you implement an optimization. That’s a good change. Maybe you have to go and tweak all the thresholds. Or, if you reduce the noise, maybe you’ve got to go and tweak the thresholds too. It’s better than eyeballing it. It’s still not quite there yet. This is just a copy of the graph from before that shows how you maybe could use the thresholds. The three-sigma rule is a good rule for normally distributed data. If it’s three times the standard deviation, that’s probably an unlikely value. Maybe that’s performance change. Again, you can do this, and I know systems that do do this. I think SAP wrote a paper on this many years ago. They tend to be quite high overhead for the developer.

I’m going to talk now about change point detection, which was a new way to me when I learned about it a few years ago. I’ll cover what it is, and then cover a little bit the backstory and some of the details. Change point detection works really well. For that situation I described with the threshold, it doesn’t work so well. If you have a step change in performance, this is a benchmark over time, so some time series data that measures throughput. It going up is good, you’re getting more performance out of the system. You have this problem that maybe you want to tweak the thresholds as the performance is changing, because these are good changes.

Change point detection basically sidesteps that, because what change point detection will do is it will find the places in a distribution or a time series where there are fundamental shifts in the underlying distribution, which, for us, is a fancy of saying it finds performance changes basically in your metrics. You can guess maybe where there would be some here. To me, these look like change points here. You have a persistent and fundamental shift in the performance of this metric. Change point detection, as a general concept, has been around for decades. It’s used in biomedical research. It’s used in financial trading, fraud detection, a whole bunch of stuff. It’s only in the last 5 years or so, that it’s been used for software performance.

I first learned about it from a paper from MongoDB, which was done in 2020. It’s actually how I came to meet my co-founder. This was, I think, the paper that introduced it to a lot of people. Then a few years later, in 2023, me and some other colleagues from DataStax wrote a follow-up paper where we basically improved on the first paper. Now it’s used in a bunch of different companies. It’s used in different products. It’s really starting to, I think, gain traction as a useful tool in your toolbox for measuring performance and detecting changes. Because the field is so old and vast, there are many algorithms, and there was no way I could possibly cover them all. What I will do is I’m going to focus on E-divisive because that’s the one that’s used in the Mongo paper and then the paper that I co-wrote. What I will provide is a general framework for thinking about change point algorithms.

In pretty much all the change point algorithms, there is basically three components to the algorithms or the process. You need some method for moving through the time series and finding things to compare. Then you need a method to do the comparison. Then, what you need is a final third method to basically make some determination on the things you found. What this looks like is, the search and the comparison work hand in hand, and they will go through the time series trying to find candidate change points. These things look like change points, but it doesn’t make a determination whether they actually are change points. That happens in the filter function. What you’re looking for is statistically significant changes in the distribution.

You have these three functions, and then the way you implement all of them is varied. The search one, you can use pretty much any technique you use for the search algorithm. You can do sliding windows side by side, then move through the time series, and you’re comparing the things in the window. You can do top-down, where you take your time series, and just basically split it into chunks, and then compare the chunks. Bottom-up is essentially the reverse of that, where you start with chunks and merge them together. E-divisive, for the search one, has the top-down. It’s a bisection-based strategy. I’ve covered a couple times this idea about the distribution of your data, whether it’s non-normal or non-parametric. Non-parametric, is you don’t assume any properties of the underlying distribution, which for a non-normal distribution, is pretty handy. If you know your data is normal, probably not, but maybe it is, you can use standard MLE type stuff.

For the non-parametric things, you can make further choices in implementation based on whether you have multiple metrics or a single metric. You can find all these on Wikipedia, basically. You can use any of these mathematical concepts for comparing distributions or comparing samples from distributions. The filter is the final step. Again, it takes into account the nature of the data. You can use like a Student T-test, which is actually how Hunter works, the paper that I co-wrote. Or, you can do some permutation-based things. There’s a bunch of other stuff that you can do. The point here is basically, for the comparison and for the filter, you need to take into account the shape of your data, like what it actually looks like to get the best results.

I’m going to cover very briefly walking through the E-divisive algorithm. This is not supposed to be all-encompassing. These steps are pretty intricate. There’s some matrix mapping there, which I won’t go into. Hopefully, it just gives you an idea of how a top-down, non-parametric CPD algorithm would work. We start off with a one time series, and we find some place in the time series where we want to cut it into two chunks. Then we recursively go on the left and the right-hand side. Let’s take the right-hand side first, and just basically do the exact same steps again. We search, find candidate change points. Let’s say we find one on the right-hand side. Then we do the filter, is this a significant change? No, and it’s done. The filter method is used as a stopping criteria in a bisection-based algorithm.

Go to the left, chop that in half, and do the same thing, basically on the left-hand side of those little chunks there. Because the field is so vast, there are many frameworks, libraries, and projects that use CPD because it’s not just used for performance changes. It’s used for a whole range of different stuff. The single processing algorithms project is the one from MongoDB. It’s what they wrote for their paper in 2020. Ruptures is a Python library that’s got a vast collection of CPD algorithms. If you wanted to just try some out, that’s a really good one to get started with. You can plug your data in and go nuts. Perfolizer is created by Andrey Akinshin. It’s used in benchmark.net. Again, that’s got multiple CPD algorithms you can take a look at. Hunter is the open-source project for the paper I co-wrote, which is based on the MongoDB library, but uses Student’s T-test and gets more deterministic results.

Unfortunately, I’d like to say CPD is amazing, and you should use it no matter what, every time for everything. Of course, that’s not true, because of the way CPD works with these candidate change points, you need a certain amount of data points to decide whether there’s performance change or not. For Hunter, it’s between 4 and 7. What this means is, as new data points come in, you can have change points be calculated after the fact. You need to be able to handle this system where even though you’ve recorded your benchmark run and you’ve moved on, days later, maybe, you’ll get a performance change notification. I found this works really well. CPD in general works really well as the backstop in a performance testing framework.

Maybe you have some faster, less precise measures of change, maybe using some of the measures of noise, using some of those tricks to find performance changes, and then CPD, as like the umpire, essentially in cricket, that’s right at the back, and with all the data, makes determinations across the whole data range about whether there’s a performance change. You need multiple data points to run these algorithms. Also, as compared with something like threshold-based techniques, where you’re using coefficient of variation or something, CPD is more computational expensive. It just is. The more data you have, the more metrics you have, the longer it takes to calculate this stuff.

Key Takeaways

Your brains are amazing. They are amazing pattern matchers. Sometimes we get tricked, and for performance charts in particular, it’s difficult to correct ourselves. You need to use some statistical tools, particularly because noise is inevitable, it’s everywhere. Even if you try and minimize it, there are limits to what you can do. CPD is a good tool to have in your toolbox for working with noisy benchmark data.

Questions and Answers

Participant 1: You said one of the drawbacks of CPD is that sometimes it takes days until you notice, because you need several data points. Do you know any way that I could detect performance change or regression immediately in a noisy environment, or is this completely hopeless?

Fleming: You can still detect performance changes in noisy environments immediately. It depends on the magnitude of the change. I have a lot of faith in CPD, and so I will do things like use alerts based on it, or alert the whole team. With detecting changes immediately in a PR, for example, I don’t put as many restrictions on it, I don’t block the PR. I don’t necessarily alert anybody, because it tends to be less reliable. You can do it, but if you have a really huge performance drop, you can catch that immediately. Depending on the noise and the magnitude of the change, that becomes more difficult. It’s easy for small changes to slip through. One of the nice things about CPD is it catches this death by a thousand cuts problem. You can totally do it.

Participant 2: In those examples that you showed, there was quite obvious cliffs of change. Over time on a project, there might be so many incremental, or such an incremental performance degradation, is there a risk that it might not be significant enough as a single data point to trigger an alert or a concern? If you take a step back and look at its journey over the last 12 months, actually, you’ve gone from something that was highly performant to now something that is the opposite.

Fleming: This is the death by a thousand cuts thing I alluded to. CPD, depending on the algorithm you pick, window size is the wrong thing, because there are window-based methods, but it alters the size of the distribution it looks at. We can go back over the whole data and see those soft, slow declines. It will flag a change point for those slow declines. Where it picks it, depends on the cleanliness of your data and things like that, but it’s certainly capable. I’ve seen it work where it definitely noticed this degradation, even though it was minor, over a long period of time. In the most extreme case, if you compared the one on the left-hand side and the right-hand side, you would see that they were different. That’s all that’s really required for CPD, is just this ability to pick some size of distribution and compare the two. It should still catch that.

Participant 3: You were talking about the problem of detecting changes quickly at the front edge. Is there an opportunity to use CPD with different parameters to detect that change rapidly, or does that become almost like an anti-pattern in a way?

Fleming: One of the things that I like, and Hunter actually has a way to do this, is you can use CPD to find the most recent stable region of a time series. One of the problems that other methods have, if you’re trying to basically use the whole time series, is things get washed out as things change. With CPD, you can say, give me the most recent change point, from that point to the end, do you get a stable region. Then that, used in combination with basically any kind of statistical test, actually is really powerful.

Participant 4: Have you got any views on how you can make performance engineering a concern of regular developers?

Fleming: There’s a number of things that we could do. I’d like to see them differently. I’d like to see it taught a lot more. It seems to be one of those things that people have to grapple with by themselves, which works. That’s totally fine. I think as senior people in tech, we should be teaching this to more people. It is a skill. It’s totally learnable by everybody. I think that would help a lot. It’s just people who know what they’re doing, teaching other people. You could do like brown bag lunch talks, things like that.

I’ve done those at companies I’ve worked for, and they’re very well received, because most people haven’t had it explained in a comprehensive way. There’s a lot of value from these kinds of talks, where it’s condensed and there’s a narrative to it, and it’s thought through, rather than sending people blogposts, which are cool. There’s a difference, like sitting somebody down and walking them through an entire talk specifically for performance engineering. There’s education, I think is one.

I think having the stakeholders in projects agree up front about the performance characteristics of things would make this a concern for many teams. This is like NFRs and things like that, and pushing that conversation at the start. One of the things that I’ve done for one of the distributed systems projects I worked on, we were adding a new feature, we were adding consistency to an eventually consistent database. At the front, I got people to agree on what the expected performance characteristics were, and then implement performance tests at the start, so that I could show people the numbers. Numbers are really powerful. I think if you have that, that really helps. Then the third one is tying it to revenue. If you can show people how bad performance impacts revenue, everybody listens to that.

Participant 5: There was one bit where you had a choice to make in your functions about like, is it normally distributed? If it is, I’ll use a maximum likelihood estimator, otherwise, I’ll use something else. How do you test whether your distribution is normal or not?

Fleming: I’ve got two answers. The short one is, assume it’s not, and just don’t use any of the parametric methods. There are ways to test for normality, if you Google it. That’s what I do, I would just Google, normality test. There are ways you can test for it. It becomes cumbersome. It depends on the benchmark. It depends on the dataset. I think a lot of the time, it’s easier just to assume it’s not.

Participant 6: For people from a data science or analytics background, tests like these may come super naturally, as this goes. Have you had luck with collaborating with colleagues in the data science department, for instance, to work on how are these methods working on the monitoring of these systems?

Fleming: No, I haven’t. I would love to do something like that.

See more presentations with transcripts

Practical Benchmarking: How To Detect Performance Changes in Noisy Results

Transcript

Background

What is Noise, and How Can We Measure It?

Where Does Noise Come From, and Why Can’t You Escape It?

Detecting Performance Changes in Noisy Data

Key Takeaways

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Firefox 137 Beta Now Available With VA-API Accelerated H.265/HEVC On Linux

People Are Paying Millions to Dine With Donald Trump at Mar-a-Lago

Mini jumps up to 8th most popular EV brand as Tesla sales recover in February – UKTN

Securing SAP Remote Function Calls: The Crucial Role of S_ICF Authorization

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Background

What is Noise, and How Can We Measure It?

Where Does Noise Come From, and Why Can’t You Escape It?

Detecting Performance Changes in Noisy Data

Key Takeaways

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News