Transcript
Lawall: I’m going to talk about, Opening the Box: Diagnosing Operating System Task-Scheduler Behavior on Highly Multicore Machines. Basically, what our situation is, is we have our machines. They can be simplified as applications on the top, hardware on the bottom. Then, basically, the application does what the user wants. The hardware provides all the resources to make it possible. We need something in the middle to mediate between them to ensure all the interactions are safe, correct, not introducing security violations, and so on. That’s the operating system. That’s going to be the subject of our talk.
The operating system as a piece of software that’s mediating the access to the hardware, can have an important and unexpected influence on the performance of our applications. Sometimes, we need to actually understand that. Most of the time, hopefully, we can just ignore the operating system and hope everything works for the best. Sometimes, for example, from changing from one operating system version to the next one, things might be unexpected. It might be different. Then one has to actually understand what’s going on. In this talk, we’re focusing on Linux, just because it’s open source, available, and familiar to me. We’re focusing on the task scheduler. Task scheduler is the thing that figures out what tasks should be running on what core at what time. What we’re interested in is diagnosing performance regressions. Our target is large multicore machines, so server-class machines.
The Task Scheduler of the Linux kernel
The task scheduler of Linux kernel, what is its role? Basically, it has two things to do. One of them is to decide where each different task should run. This should run on core 27. This should run on core 148. Then, once those tasks have been placed on those cores, then maybe there’s three different tasks on a particular core. It should decide which of those gets to run at a particular point in time. If you’re familiar with Linux kernel source code, we’re talking about the files core.c and fair.c. We’ll think about how this task scheduler can impact the performance of applications.
Basically, the problem is that to do these things that a task scheduler needs to do, it needs to make decisions. The other problem is that actually a general-purpose operating system, such as Linux kernel, or Windows, or Mac, iOS, any of these operating systems that people are just using for doing random things, actually the operating system has absolutely no information on which it can base this decision. No real information. It can get some hints from previous behavior, but it doesn’t know what’s going to happen in the future.
Then the problem is that poor decisions can slow tasks down. This can have not just the short-term effect, but it can go on for some time and actually have a visible performance impact on how the application runs. This is a slide that actually we’re going to come back to in the end. This is where we’re going. This is an example of performance impact. This red line here is the vanilla Linux kernel. What we have on the x-axis is a bunch of runs. I’ve just sorted the runs by running time. The fastest runs are on the left. The slowest runs are on the right. It’s not chronological. The y-axis is the running time. You can see that the running time, you can focus on the red line. That’s the vanilla Linux kernel. The running time slowly increases. Some runs are a little bit faster than the others.
At the end, actually what we’re concerned about is this thing over here. Things are going up sharply. Same inputs, same computation, but very different running time. These numbers here are random. How badly things go just depends on some unknown thing that happens during the execution. This is where we’re going. You’ll see this slide again later.
The Basic Issues for Scheduling (Work Conservation and Locality)
First, I’m going to start out, what are the basic issues for scheduling? One very important issue, which will come up over and over during the talk, is called work conservation. The basic, simple idea of work conservation, we have a task that’s waking up, and where should we put it? Actually, there are four cores available. None of them are doing anything. We can put it anywhere. It doesn’t really matter. There we have our task 1. We’ve put it on core 0. Now we have another task, task 2. It’s waking up, we ask ourselves the same question, where should it go? It could go on core 1, core 2, or core 3. Those are all available. Putting it on core 0 would not be a very good choice. If we put task 2 on core 0, either it will have to wait for task 1 to finish, or maybe it will preempt task 1, and now task 1 will have to wait for task 2 to finish. That’s all very unfortunate, because there’s plenty of empty space over here for those other tasks to run.
The idea of work conservation is that no core should be overloaded if any core is idle. This is just like a concept that you can think about. It’s not the overriding concept. There could be another good decision in certain situations. This is a concept that is important in many cases. Another concept, which is perhaps in contradiction with work conservation, is the concept of locality. Locality here, we already have task 1 running on core 0. We have the same questions again. We have a task 2, where should we put it? Now we have this little two bar thing here. It means it’s a two-socket machine. We have basically two NUMA nodes. Now we might be concerned about memory accesses. We again have our situation with our three empty cores. If task 2 is sharing a lot of memory with task 1 or communicating a lot with task 1, then it could be a good idea to put task 2 on core 1.
On the other hand, if task 2 has no interaction with task 1, it might be a good idea to put it over here so that it doesn’t destroy the cache of task 1. There are some different tradeoffs. Again, we have to know what’s going to happen in the future. We have absolutely no idea about that information. That’s what makes everything complicated.
A Challenge
Basically, we have seen these principles. If you do things right, things go well. If you do things wrong, things go badly. Overall, the task scheduler controls the application’s access to the CPU, and so it controls the performance of the application. Work conservation, locality issues also, it’s not like the cost of scheduling. It’s going to impact all of the execution of the application after you have placed it in a particular place, and so it can have a real impact on the application performance.
Someone named Brendan Gregg, who works on a tool called perf, says we need to be able to understand the performance quickly and completely. This talk is pushing you in the direction of completely, so you’re not just saying, this is faster, this is slower, I don’t really know what’s going on, but this is better because it’s faster, something like that. Here we want to really understand what the operating system is doing and why, and how it’s impacting the performance of the application. I’ve been working on this problem for the last year, actually, so I’m not sure if I can really say that we’re doing things quickly, but at least we try to understand things completely. The problem that we have is the operating system, and the scheduler itself.
First, we don’t want to look at the operating system in the first place, and then the scheduler is buried deep within that operating system, so how can we actually understand what it’s doing, what decisions it’s making at what point, and so on. That’s what we’re getting at in this talk.
Tooling
Fortunately, some help is available. There’s a tool which is called trace-cmd. You run your application here, and trace-cmd record will record all the scheduling events, and then you can see precisely what’s going on at every nanosecond, basically. Then you can use trace-cmd report on your trace, and then you get this output. This is obviously extremely abbreviated. You can have gigabytes of data coming out, because many things happen on your machine all the time, and this trace is going to include all of them. This trace gives you all the information. You know what command’s running, what its PID is, what it’s doing, what time, and so on. The trace is sorted chronologically, which is in some sense helpful, but it’s hard to really trace through what’s going on. Here things are happening on core 26, core 62, and so on. They don’t really have anything to do with each other, so it’s hard to look at this textual output and actually understand what’s going on with the application.
Fortunately, there’s also another tool that comes with it. This is called KernelShark. KernelShark will show you the different CPUs that are on your machine and what the CPU is doing at each unit of time. Then we have the same output we saw on the slide before when you want more detail. This is a very nice tool. You can zoom in, and then you get more information about a particular point. It’s a bit frustrating to use if you have a very big machine, because here we see two cores. It’s already taking up most of my slide just to see those two cores. It’s a bit spacious, and it’s a bit too friendly in certain directions that are not useful for all tasks. It’s definitely useful for certain tasks, but maybe not for everything. Actually, in my own work, I found it useful. I wanted to have an overview of all of my cores at once to see what’s going on everywhere, and so this tool was not really completely successful in that direction.
Basically, we want to be able to visualize these traces in some way, or we’re going to see a bunch of different ways we can visualize the traces. We want to see all the activity and all the cores at once. We want it to scale to multiple gigabytes of data. I don’t know about your applications. I don’t know if that’s enough to understand your applications, but that’s at least enough for what I was looking into. I found it useful also to have something that would produce files that I could save, that I could compare, that I could share with my colleagues, and so I wanted to produce PDFs. Then, something that my colleagues complain about a lot is that I’ve totally given up on interactivity. With KernelShark, you can zoom. It’s very nice. You see something immediately. With all of my tools, you have to actually run it again and say what range you want. I haven’t actually found any drawing tool that really gave the interactivity and scalability up to the size of the data I was interested in.
Basically, I’m proposing four tools. This is the current state of the work. Dat2graph, it’s a horizontal bar graph. It’s like KernelShark. It shows you all the cores and what they’re doing at any point in time. Running_waiting is a summary. How many tasks are running? How many tasks are waiting to be able to run? That gives you an idea of, what are the overloads on your machine? In some regions of your execution, are there some overloads? Stepper, is like zooming in even more. It shows you all the cores, all the tasks on the cores, the command name, the PID. It’s good when you know there’s this problem at this particular point and you want to see exactly what’s going on in the machine at that point. Finally, enter events, which interleaves. Eventually, you’ll need to make your own kernel events in your own print case, basically printfs that print out the information that you want. This manages that and interacts with the scheduling. We’ll see how all of these tools can be useful in debugging a particular performance problem. All these tools are publicly available. The URL will be at the end of the talk.
The example that we’re going to look at is based on an application from the NAS benchmark suite. These are somehow high-performance computing applications. The application we’re considering is BT, Block Tri-diagonal solver. The only thing that we need to know about this is it has N tasks on N cores. If we have 64 cores, we have 64 tasks. 64 tasks on 64 cores sounds like an extremely trivial scheduling problem. You just put one task on each core and let them do what they want, and then everything should be fine. Everything would be fine if the Linux kernel scheduler would actually do that. Unfortunately, we’ll see that it does not. Here’s a bunch of runs of the application, this is 20 runs. Again, I’ve sorted them by runtime. We have here the fastest run. There’s a bunch of runs that are close to that. Maybe they have some other locality issues, for example.
Then you see, like I mentioned before, we have this upward trend at the far end. Most of our runs run perfectly well as we expect them to, but every now and then there’s one that is much slower, and we want to know why. This dotted line compares the fastest run and the slowest run. You can see it’s about a 20% overhead for the slowest runs. To start out with, we can just get to know our benchmark. This is a scientific computing application. The standard thing to do with scientific computing applications is just to pin all the threads onto the cores and let them do what they want. Our interest is to find out how the Linux kernel scheduler is going to manage these threads. We don’t want to pin them, but it’s interesting to compare the difference.
If we pin them, we get even more performance improvement, and we don’t get this huge jump at the end. We just get a small increase. Here I have not actually pinned specific threads to specific cores. I have said the first half of the thread goes on one socket. This is a two-socket machine. Put half of them on one socket and put half of them on the other socket and let them do whatever they want beyond that. The fact that we see a difference between the execution times suggest that memory locality is actually an issue for this benchmark. That’s something just like having some idea of what the benchmark does and what its issues are.
Now what we can do is we can make some runs and try to see what’s going on in those runs. This is the tool dat2graph. We run the application with trace-cmd. We obtain this dat file. It’s fairly big, maybe. Then we can pass the tool to dat2graph, and then we get output like this. In the x-axis, we have the execution time. In the y-axis, we have the set of cores. These lines here represent the different tasks that are running on the different cores. Basically, there is a different color for each PID, so for each different task, but there’s only a finite number of colors. For example, here we have two green tasks. They’re obviously not the same PID. It’s a bit approximate, but it’s never really been found that to be a problem in practice. Maybe we could use random numbers, but when I tried random numbers, the colors sometimes came out too close to each other anyway, and so it wasn’t really beneficial.
Basically, what we see here is that these tasks run along. Sometimes here we have a task that was on this core. These two tasks have switched with each other, but there’s not much other excitement going on here. Here’s a somewhat slower run, not very much slower, just a little bit slower. Here is a very slow run. Before we were around 8, 8-and-a-half seconds, and now we have jumped up to over 10 seconds.
Gaps, and Their Impact
Now what we’re going to do is we’re going to look into, like how is this run different from the first run, and what’s going on? Basically, the issue that we find is that once we start looking at these slower runs, we start to see these gaps. There’s a gap here. There’s a gap here. What does a gap mean? A gap means that the task was running along, and now one core is not being used at all. Using fewer cores, less CPU time available, so it seems natural that we should get slower. If we look here, this one actually has a huge gap. This is the slowest run, and it has a huge gap. These gaps seem to be important with respect to the performance. What’s the impact of a gap? Maybe there’s just nothing to do at those points, and in that case, the gap is just nothing to do, so we don’t do anything, so it doesn’t matter. You can think back to what I said previously. This is N tasks on N cores. There probably is something to do, so this is probably not our situation.
The other situation is that if some core is idle, that means that some other core is probably overloaded, and those overloaded cores are going to be running at half speed. Basically, you have two people standing in the same place, and they can’t both walk forward at the same time. One of them will walk forward for a little while, and the other one will walk forward, and then another one will walk forward, and so they’ll be going half as fast as all of the other tasks that can run freely by themselves. This is actually also an application that synchronizes very often. We have these two tasks that are walking forward slowly, and then we’ll have all those other fast tasks, and they will be waiting for the slow ones. That will slow everybody down, and not just those two slow tasks. That can have a big performance impact.
Basically, we want to see which situation we’re in. I’ve hinted that we’re definitely in the second one, but we can see that in practice. The tool, running_waiting, is going to show us where there are overloads. Here again we have x-axis is the time. Now the y-axis is the number of threads that are either running, which are the green ones. We have a 64-core machine, and so it’s not surprising that we have 64 running threads. When we are up at this line, that means that everything is as we hope it to be.
Then the red line is the total number of all of the threads, which means that the difference between the red line and the green line is the number of waiting threads. These are tasks that are waiting in the run queue of some core. They would like to run, they are able to run, but there’s no space on that core for them to run. That’s slowing things down. You see lots of red here. This is a graph, it’s made with some graphing tool which has some resolution, and so on. What we’re interested in is really the width of these lines, and since they’re lines, they have a minimum width, and so this is perhaps a bit misleading. What we can do instead is we can focus on what we’re really interested in, which is the application. The application threads would be the one that have been forked since the start of the application. If we just focused on the forked ones, then we have quite a different picture. This is the fast run. In the fast run, most of the time, the application is running at full speed on all of the cores, and things are as they should be.
On the other hand, this is the slow run. In the slow run, now you can clearly see that there is a problem. We saw the gap in this region here, and what we see from this particular graph is that, up to this point, we have 64 threads running. Now we have only 63 threads running at any point in time, but we have 64 threads that want to run. One of our application threads wants to run, but it’s not able to run. This gap is indicating a work conservation problem. Basically, somewhere on the machine, there is an idle core. Somewhere else on the machine, there is a task that wants to use that idle core, but it’s not able to get there for some reason. This is a long overload period. It’s lasting from 7 seconds, that’s like 3 seconds. It’s basically one-third of the execution time. Things are not going as we hope they should go. This is really undesirable for the behavior of the application.
Now we can try to understand what happened, what caused this bad situation to arise? We can go back to dat2graph. Dat2graph has an option, which is color-by-command. Color-by-command colors all the tasks, not by their PID, by what command they’re trying to execute. These blue lines are BT, that’s the application we’re interested in. At the beginning of our gap, we have a green line, and the green line is called migration. Then you have to ask yourself, so what does migration do? Migration is a task that comes from the kernel. It’s a kernel thread, and its purpose is basically to take some happily running task on some CPU, yank it off that CPU, and put it some other place.
Basically, migration wakes up, it preempts the running task. Then the running task can be moved off to some other CPU, and then migration goes to sleep once its evil work has been done. Migration is used for load balancing. Now our question is, why do we need load balancing? We have N threads on N cores, the load should be balanced. The kernel should just deal with it. Clearly, the kernel was dealing with it before. We can see what’s going on.
Is Load Balancing Needed?
Now we go to the stepper tool, where we see exactly what’s happening at a particular point in the code. Here we have all of our cores, and here we have the running task. Here, this bar means, it separates the running task from the waiting tasks. Here we have a task that is waiting. It has just woken up, and it has not been able to run yet. It will get to run very shortly. This is the situation just before our migration tasks run. We can see down here, what is the event at the kernel level that’s happening? The event, it says, sched_move_numa. This is a special kind of load balancing, which is actually not moves with respect to the load, it’s with respect to the memory usage. NUMA balancing is, you study all of your tasks, and what memory are they accessing, and if you find out that a task that’s running on one NUMA node is actually accessing a lot of data on another NUMA node, then you move it over there in the assumption that it will be happier.
Basically, what we want to do here is we are moving this task. This is one NUMA node, this is the other NUMA node. It’s moving it from this NUMA node into this space. This looks like a really excellent idea, because now it will be over there with its own memory, and this space here looks like it’s empty, so that will be a good place for it. We know we have N tasks on N cores, we have 64 of these BT threads, that space is not going to be empty very long.
There is some other BT task that was actually already planning to run on this socket and run in this place. 214 is the one that we moved, and even before this 214 task ever gets to run on core 51, the task that had wanted to be there before has woken up. Actually, here’s another overloaded core, but there’s another space for it, so hopefully that won’t get moved over soon. This is the migration task here that will go away, then we’ll have two slots, and this one will get moved in there. Basically, we are now in a situation where we have two sockets, each one has 32 cores, and this one over here has 31 threads running on it, and this one has 32 threads running on it. That’s a bad thing.
Assessment
Now we understand exactly what’s going on. We have our NUMA balancing, the goal is to address locality issues. Moving a task to another NUMA node, if it has a locality issue, that should improve its performance in theory. Remember, it might be important for BT, because BT, when we had everything perfectly placed according to the NUMA nodes, the performance was better. There are two types of NUMA balancing, one of them is when you switch cores, you take a thread here, a thread here, and you exchange them. That seems fine. We’re going to stay with the same number of threads on each core, no work conservation issues.
The other one is the one that we actually had in our situation, which is move, and in that one you just take some thread over here and put it on the socket over here, and hope for the best. The problem is, in this case, we risk introducing an overload, and that’s actually what happens in our case. We have an overload, but the scheduler is supposed to manage these kinds of issues for us. The really important question is, we have this overload, why does it not get corrected? Why does the scheduler not just realize that this is a very bad situation? It never resolves it, the application just ends. Why does it take so much time for it to figure out what to do, or why does it spend so much time not doing anything at all?
To understand that, we need to actually delve into the code, and figure out what the load balancer actually does, because that’s what we hope will do the work for us. We have overload now, it’s not a NUMA issue. We have overload, so load balancing should help us deal with overloads. Load balancing has several parts. First, it has a bunch of filters. The first filter is, should_we_balance? Basically, what happens with load balancing is, on every clock tick, each core gets to reflect on whether it wants to steal load from some other part of the machine. The problem is that we don’t want all of those cores to be stealing tasks from other parts of the machine at the same time, because they will have the same vision of what’s going on, and they will all naturally try to steal the same task. That task will have to bounce around the machine with no practical benefit. Should_we_balance addresses that issue.
Then, if we are the core that should be allowed to balance at the current moment, then we have to find who are we going to steal from. We find the busiest group, and in our situation, that’s going to be the busiest socket. It should be obvious to you which is the busiest socket, it’s the one with the 33 threads on it. Then we want to find the busiest queue. The core that is going to steal, it’s just going to identify one other core that it should steal from, and then it will take some threads from that core. If it needs to, it can actually then go around, there’s a loop, and it can steal from some other cores later. In our situation, we just have one extra thread on one socket that we will need to steal, so there’s not that complexity.
Basically, we see if the current CPU is the one that should steal, and then we see if that CPU can find another CPU that it can steal from. If any of these conditions are violated, then we just abort, and then we try again in the future. If we manage to find a core that we can steal from, then we look through its different tasks and see if there’s one that we can steal, or more that we can steal. If we are able to steal something, that’s great, that’s success. There’s a bunch of heuristics here that help you try to decide which is the task that you should steal. If you’re not able to steal anything, there’s another kind of stealing, which is called active balance. Active balance is the last resort. Active balance is able to steal a task that’s running. Otherwise, you only steal tasks that are just waiting, because they don’t have any context, they’re easier to steal somehow.
Custom Tracing
Basically, we have this flowchart here, and we want to understand, why do we not end up at the success places in the flowchart? Since we never manage to steal anything, nothing ever runs on this core, it seems that we always end up in the abort context. We want to figure out why does this unfortunate thing happen? All of the graphs that we have made before are based on events that exist already in the Linux kernel. In this case, we want to put in some specialized things, because we want to understand, actually, how we get through this function, and that’s not something that other people have wanted to do before. Basically, we want to add custom events into the kernel. You just think about how you debug things in the same situation in user space, and you might put in prints, or you might use a debugger. It’s not super easy to use a debugger at the kernel level, but you can put in prints. It’s a nice command called trace_printk. It works just like ordinary printf at the user level with the same percents and formats and everything. It’s very simple to use.
Then we have another tool, which is called events, which shows when these events happen and how that relates to the other scheduling activities. I’ve put trace_printks at the different places in this flowchart, like when we fail here, when we fail here, when we fail here, and when we fail here. The events tool has lots of options, and one of them is this cpu 40. Basically, I have compiled my kernel. I have installed my kernel. I have run my application under this new kernel. I have got a new trace. Then I want to print out the events, and I can use dat2graph to study the trace to figure out where the gap is and what core it’s on. Now I use the events thing, because there’s lots of events. I use the events thing to focus on the core that has all the idles. That’s why we only see events on this core. There are lots of events that happen on the other ones, but they’re not interesting. What we’re interested in is, what does this core do and how does that prevent it from actually being able to steal anything.
You can look over here. Here we have the names of the little strings that I put in my trace_printks, so the names of my events. I also know how many of those events are and what percentage of the total number of events they are. If you do a little math, then you can see that here we start load balancing 864 times, and every single one of those times we fail for one of these different reasons. Most of the time we fail because we are not able to find a busiest group. This might seem surprising, because we clearly have a busiest group. One of them has 33 threads on it, one of them has 31 threads on it.
The Linux kernel has some different characteristics, like categories and different heuristics for deciding that one socket is busier than the other one. Often it fails just at this level. This is 693 out of 864. The next one is we succeed to find the busiest group, but now we don’t find a busiest queue. These are little red diamonds. You can see that there are many green Xs, those find_busiest_group is failing. There are also many red diamonds, and those red diamonds are scattered around a lot. There’s 149 of them.
Again, this is failing because we have a bunch of different heuristics. Even though we see that there is an imbalance between the different cores, the Linux kernel developers have decided that the correct heuristic in this situation is that that imbalance is not enough to motivate doing anything. It’s surprising because we have an imbalance situation that persists for multiple seconds, which is a long time from the point of view of an operating system.
Again, it’s like heuristics, they don’t know the future, they don’t really know that much about the past either. According to its strategy making process, it makes this particular decision. There are 21 cases where we managed to find a busiest group, we managed to find a busiest queue, but we still don’t manage to steal anything. This is for the active balance. Active balancing is a violent operation because we’re stealing a task that’s actually happily running on some other core. There’s a threshold, and you have to get over the threshold of failures before you can actually do this active balancing. There’s these very few cases where we even reach active balancing, but we always fail to get over the threshold. Our counter value is only 0 or 1, and the threshold we have to get to is being greater than 4. All a bit arbitrary, we don’t make it.
Observations
Basically, as I’ve summarized so far, the find_busiest_group fails a lot, but from time to time we get through it, we get to find_busiest_queue. Find_busiest_queue almost always decides that the load is balanced. This is due to something called a migration type, which is chosen to be migration_util. It’s not very important what that means. The thing that will allow us to get into the active balancing is called migrate_task. We want to have it be migrate_task instead of migrate_util. What we’re going to do now is we identify the problems, we’re just going to violently fix those problems, and then see if it makes things work. I’m not necessarily going to propose a correction for the Linux kernel, I’m going to propose, do something that makes my application give the right performance.
Then once we validate that our hypotheses are correct, then we can at some future time think about what is the better way to get the Linux kernel. The problem with the Linux kernel is it has to work for everything, it doesn’t have to just work for this specific application. That’s the direction in which we’re going. We’re going to try to switch to migrate_task, get rid of this migrate_util thing, that will get us through the find_busiest_queue, then we will try to do actually normal stealing of waiting threads. That’s going to fail as well, because actually all of our threads that are on that other socket, the NUMA balancing have decided they would prefer to be on that socket. Actually, we have a tradeoff between memory locality and work conservation. Then we fall down to the active balancing, that’s the last hope.
As I said, the threshold doesn’t get high enough. We can see how that happens here. If I changed from migrate_util to migrate_task, we still have most of our failures being find_busiest_group. Find_busiest_queue happens only one time. We end up with a lot of these need_active_balances with our counter, since we reach that code more often, the counter gets to 0 to 1 to 2 to 3 to 4, but it never gets to the 5 that we need to get over the threshold to actually do the active balancing. That suggests that the counter is the bottleneck, it’s maybe too high, or maybe it shouldn’t be there at all. To investigate the problem, we are just going to get rid of it.
Resulting Performance
We want to avoid the find_busiest_queue failures, we did that already. We want to avoid this threshold, this counter failure, and so we’re going to do that now. Basically, I just modified the Linux kernel to make the situation be as I want it to be. Then we end up with this graph that you saw previously in the beginning. The red line is the vanilla Linux 6.7. The green line is when we use migrate_task. These lines, they’re ordered in a certain way, but these numbers are actually fairly random, so using migrate_task can give you worse performance or better performance than the original Linux kernel. It doesn’t really have any impact at all. It’s like we fail early, or we fail later, but we still fail, so it doesn’t make any difference.
The nice thing is that when we put the two changes together, we switch to migrate_task and we remove the threshold that we need to do the active balancing, then we have this blue line. This blue line shows a very slow increase in the running time, but the difference between the fastest time and the slowest time is not that great, actually I don’t know what percentage it is. We can also look at this from a different point of view, which is the work conservation, that’s what you see at the bottom. This is showing you what percentage of the time does the application spend with a work conservation problem. At the fastest times, with all of the different variants, it has basically no work conservation problem, maybe very small time.
The middle times, there’s some work conservation problems, sometimes the issue maybe is work conservation, sometimes the issue maybe it’s memory locality. It’s kind of a tradeoff. When things go badly, we’re spending up to a third of our time with a work conservation problem, that is with idle cores and overloaded cores. The nice thing that we can see is if we make all of our changes in terms of work conservation, basically the work conservation problem has gone away. There are executions that are slower than other executions, but the problem is not work conservation. Then we could go on and try to understand what the rest of the problem is, but I don’t know what it is, I haven’t done that yet. I just want to emphasize, these changes that I made, I don’t consider them to be a solution that we should use for all applications, it’s just a way of probing the system and understanding what’s going on.
Conclusion
To understand the scheduler behavior, you have to have a precise understanding of what’s going on at every point in time. What’s going on at every point in time is like a really huge amount of information. You need some tools to help you organize the data, so you can zoom in on the part that’s actually interesting for you. We have these four tools that we have found interesting in a general situation. Maybe we have too many tools, maybe we should just have one tool. It’s awkward to have to remember the different lists of arguments that different tools have, and so maybe just one tool would be good. Maybe you could also have a domain-specific language that would let you program the thing that you are interested in seeing. You wouldn’t have to program an entire visualization tool.
It’s a lot of work to make the colors come out in a nice way and make the thing be visible, allow you to zoom and all these kinds of things. You know what things you are interested in seeing, and so it’d be nice to be able to somehow easily specify what things you should want to see and what should be their relationships to each other, and what should be the format in which they are presented, and so on. This is the URL where you can find the tools, https://gitlab.inria.fr/schedgraph/schedgraph.git. It’s a little bit of documentation. There’s a link to a paper that gives a description of how it all works.
Questions and Answers
Participant 1: You mentioned the possibility of a DSL to specify visualization of large quantities of events. That seems to be something that would be pretty generically useful for the analysis of distributed systems as well, because of the problem of collecting lots of event data and then needing to somehow visualize it. It seems to be something that comes up a lot in all sorts of performance analysis, not only of a single system, but of larger systems as well.
Lawall: I don’t even have an answer for my particular case. I did make a language and then I proposed it to some colleagues and I said, what does this specification do? They had no idea. I don’t think my language was a success. So far, I don’t have an answer. It is a good point that maybe such a language, even maybe these visualization tools, if you can also generate your data in the format that it expects, maybe it could be also interesting for other large-scale systems.
Participant 2: To what degree does the scheduler take into account the L0, L1, L2 caching and the different number of processes that they might encompass, or is it just not important?
Lawall: The first question is about the different levels of cache. I didn’t emphasize it because we were only interested in the inter-socket thing, but there’s actually a hierarchy of load balancing. The bottom of the hierarchy is the hyper-threads, and then there are the things on the same socket, and then across the machine. It favors moving things around locally. I think that addresses that issue.
Participant 2: Is there any way we can give hints to the operating system about how it should schedule?
Lawall: There are some knobs that you can turn. There’s some runtime, I think it’s more for favoring performance versus energy consumption. I am not sure if there’s any hints that would address this specific very low-level issue.
Participant 3: Is there any specific like gaps you saw in the visualization tools you were using from the massive scale problems? When you’re actually evaluating the movement of tasks around, is there any kind of quantization of how much it costs to really move a task, because context switching is always expensive? What’s the threshold of when moving actually makes sense?
Lawall: First question is visualization tools. These graphs are all made with a quite old tool, which is called JGraph. Basically, if your data is very large, then you may want to have some coffee while it’s dealing with the issue. Actually, I had an intern who used Plotly, I believe. The nice feature was that it allowed interactivity. The not-so-nice feature was as soon as the data got a bit larger, actually nothing happened at all.
Participant 3: Task switching and context switching is always expensive.
Lawall: Part of it is the question that was already asked about the L0. Part of it is the fact that the active balancing is the least favorite option. It’s very hard actually to get into that. That’s the most costly thing because it preempts active tasks. There’s actually several context switches that are involved in that. The whole scheduler is just a large collection of heuristics. Sometimes they’re very intertwined. A certain heuristic here is actually for some impact here as things like that, and it’s very complex.
Participant 1: You ended up adding printks and recompiling your kernel. Do you think it would be feasible to get the same information by using a kprobe and eBPF, or would that be too intrusive? A lot of people find, for better or worse, recompiling the kernel to be intimidating, but the eBPF tooling has gotten better.
Lawall: I’m not an expert on eBPF. I just would point out that what we’re doing is like there’s a specific if or something like that. It’s not on function boundaries. I don’t know if eBPF is limited to that or not, but there’s specific ifs that we need to do. To make the specific changes, that was involving an if test, it’s like if x and-and-and y, and I want to get rid of y. I’m not sure that eBPF can actually make that kind of change at the kernel level.
See more presentations with transcripts