In the previous posts (first, second) we have implemented a Floyd-Warshall algorithm (in four variants) and a routes reconstruction algorithm. In these posts we made our way through the basics of all-pairs shortest path problem, data representation in memory, parallelism, vectorisation and how we can adapt algorithms to data specifics.
In this post we will continue our journey and explore a new, more efficient, way to solve all-pairs shortest path problem. However, this time, besides employing vector and parallel capabilities of the CPU we also will employ L1, L2 and L3 caches.
Sounds interesting? Then let’s write some code 🙂
Well, not so fast – before exploring the theory and code behind the algorithm I suggest to spend a few moments and revise basic information about “CPU caches”.
If you have a solid understanding about the topic you are free to skip it. In case you decided to read it (because you are just curious or you want to revise the basics) please take in account that the following information is an oversimplified overview.
– Author’s note
What is cache?
Most probably you heard something about CPU cache, at least in advertisements of a new Intel or AMD chips or during any kind of hardware discussions with friends or colleagues. Here, we aren’t going to discuss what CPU cache exactly is or who is better in implementing it Intel, AMD or Apple.
Instead, we will try to build a mental model of the cache and learn how to use it to:
- … better understand limits we can or can’t push when writing high-performance code in C# and .NET.
- … better understand application behaviour in high-performance scenarios on different platforms, with different parameters.
- … brag in front of our friend when someone says “cache”.
So, let’s start from the basics… what a heck is a cache?
Imagine yourself sitting in an empty apartment in front of a PC. You are working on a new feature and you are doing great. You are blazing fast, no errors and no mistakes. You are in a perfect flow.
You are the CPU.
Periodically, you need to consult documentation. All documentation you need is available in the library across the road.
Library is the main memory.
You are free to enter the library anytime and read everything you need. However, every visit to the library takes time and prevents you from advancing your work, which doesn’t sound productive. To improve the situation, you started to write down pieces of the documentation on sticky notes and put them on the display.
Sticky note on a display is the first level cache.
Now when you need something from the documentation you first look at sticky notes and only if information isn’t there you have to visit the library.
When you found what you need on sticky note – it is called a cache hit, otherwise – a cache miss.
Having sticky notes on the display reduced time spent on “library visits” but didn’t remove it completely because you can’t know what documentation you would need in advance. There is a natural limit of how much useful information you can bring back from the library.
The amount of useful information you can bring from the library is called a cache line.
You can’t have millions of sticky notes on the display, so when there is not enough space you have to throw some of them into trash.
When you throw away sticky note because of space limitation – it is called an eviction from cache.
To stop throwing useful information away (and stop visiting library when you need it again) you bought yourself a “small box”. Now, when you bring “new” sticky notes from library and need to free up some space on the display, instead of throwing “old” sticky notes into trash you put them into a “small box”. This way, if you need this information once again, you will be able to pick it up from the “small box” and put it back on the display and save yourself a library visit.
“Small box” is the second level cache.
The “small box” can fit much more sticky notes than the display, however, it is not unlimited. Therefore, when some of sticky notes don’t fit in the “small box” you still had to throw them in the trash. That is why you decided to buy a “huge box”. Now, when you bring new sticky notes from the library you first make copies of them and put these copies into a “huge box” and only then you put originals on the display. If you need to replace sticky notes on the display, you as previously, put them into a “small box” and if “small box” is overloaded you throw them into trash. However, this time if you need this information once again – you just make a copy of the sticky note from the “huge box” and put it back on the display.
“Huge box” is the third level cache (aka LLC)
“Huge box” doesn’t eliminate “library visits”, however, if the size of a “huge box” is considerably large, then you will end up having a lot of related pieces of documentation in it, which in turn will allow you to do a significant amount of work without interruption.
That is basically it.
The above model isn’t full nor ideal but, at least I hope so, it clearly illustrates the main idea of why modern chips have multiple levels cache — to reduce number of round trips to main memory, and keep CPU busy as long as possible.
What might not be so obvious, from the model, is that all caches (L1 “sticky notes on a display”, L2 “small box” and L3 “huge box”) have different access time (aka latency) [1] and size.
Here is an example for Intel i7-6700 (Skylake) CPU:
Cache |
Access time (aka Latency) |
Size |
---|---|---|
L1 |
4 cycles |
64 KB |
L2 |
12 cycles |
256 KB |
L3 |
42 cycles |
8 MB |
Table 1. L1, L2 and L3 caches access times and sizes of Intel i7-6700 (Skylake) processor.
Please note, the above numbers are just one example. Different CPUs have different characteristics. However, the overall relation more of less remains the same — L1 is the smallest but also the fastest one, L2 is somewhere in between, and L3 is the largest but also the slowest one.
– Author’s note
To truly understand these numbers, it is critical to know how much time it takes to access main memory. On the same Intel i7-6700 (Skylake) CPU accessing main memory takes around 42 cycles + 51 nanosecond. You can notice, that besides cycles, we also have a fixed interval of “51 nanosecond” here. This isn’t a mistake because cycles can be converted (approximately) to nanoseconds using a very simple formula:
In our example, it would be:
Converting access time values in previous table from cycles to access time gives us the following:
Element |
Access time (aka Latency) |
---|---|
L1 |
1 ns |
L2 |
3 ns |
L3 |
10.5 ns |
Memory |
10.5 ns + 51 ns = 60.5 ns |
Table 2. L1, L2, L3 caches and main memory approximate access times in nanoseconds of Intel i7-6700 (Skylake) processor.
Even taking in account that this is a very approximate conversion, which ignores a lot of details, it still looks stunning — accessing L1 cache is 60 times faster than main memory.
Of course in real application it isn’t so simple, and cache related optimisations doesn’t result in x60 performance improvements, however, as I hope we would be able to see in this post, they still can significantly improve application performance.
A reader familiar with the topic for sure would notice that I haven’t mentioned anything about cache modes (exclusive and inclusive), as well as anything related to instruction caches and other things like memory alignment, hardware and software prefetch, and so on. I did that intentionally for simplicity purposes.
However, if you just stumbled on this note and want to know more, you can start from this Wikipedia page and if you want to start thinking about low-level performance I can recommend you a book by Denis Bakhvalov and a paper by Ulrich Drepper
– Author’s note
Blocked Floyd-Warshall algorithm
Blocked version of Floyd-Warshall algorithm (aka Blocked Floyd-Warshall) was introduced in [2]. The algorithm works with a graph G
of vertices V
, represented as weight matrix W
. It splits the matrix into blocks of equal sizes (creating a new matrix of blocks B
) and uses these blocks to calculate all-pairs of shortest paths between all vertices in a graph.
This definition doesn’t sound obvious, so here is an example.
Imagine a 8×8 weight matrix W
(see Picture 1a). We split the matrix into 2×2 blocks to create a new 4×4 matrix of blocks B
(see Picture 1b).
Every block of the matrix B
includes a part of all paths of the matrix W
, for instance (see Picture 2):
B[0,0]
includes paths from vertices 0 and 1 to 0 and 1B[0,3]
includes paths from vertices 0 and 1 to 6 and 7B[2,0]
includes paths from vertices 4 and 5 to 0 and 1- … and
B[2,3]
includes paths from vertices 4 and 5 to vertices 6 and 7
All vertices, represented by these four blocks are tightly related. Just looking at the matrix B
it is noticeable that you can move from vertex 4 to 0, then from 0 to 7 and by those find a path from 4 to 7 through vertex 0, which, then can be compared with the existing path from 4 to 7, which can be updated if necessary with the shortest one (see Picture 3).
If you remember the Floyd-Warshall algorithm, then these manipulations must sound extremely familiar because they resemble what it does:
algorithm FloydWarshall(W) do
for k = 0 to N - 1 do
for i = 0 to N - 1 do
for j = 0 to N - 1 do
// Select the shortest path between existing path from i to j and
// a new path from i to k to j, and replace the existing path
// with the smallest
//
W[i, j] = min(W[i, j], W[i, k] + W[k, j])
end for
end for
end for
end algorithm
// where W - is a weight matrix of N x N size,
// min() - is a function which returns lesser of it's arguments
However, Floyd-Warshall algorithm works with matrix W
(not B
). Fortunately, it can be easily rewritten into a Procedure
which accepts three blocks B1
, B2
and B3
instead of matrix W
:
function Procedure(B1, B2, B3) do
for k = 0 to L - 1 do
for i = 0 to L - 1 do
for j = 0 to L - 1 do
B1[i, j] = min(B1[i, j], B2[i, k] + B3[k, j])
end for
end for
end for
end function
// where B1, B2 and B3 - are blocks from a block matrix B,
// L - is a size of the blocks (B1, B2 and B3 are L x L size)
// min() - is a function which returns lesser of it's arguments
If we invoke the the Procedure
with B1 = B[2,3]
, B2 = B[2,0]
and B3 = B[0,3]
, it will recalculate all paths from vertices 4, 5 to vertices 6, 7 through vertices 0, 1.
The important moment here is while Procedure
indeed recalculates paths from vertices 4, 5 to vertices 6, 7 (through 0 and 1), these paths aren’t guaranteed to be THE SHORTEST because recalculation relies on the existing paths stored in blocks B[2,0]
and B[0,3]
.
Luckily, prior to recalculating the paths in block B[2,3]
, we can recalculate paths in blocks B[2,0]
and B[0,3]
using … the same Procedure
but with different input parameters (see Picture 4):
- To recalculate
B[2,0]
we invoke theProcedure
withB1 = B[2,0]
,B2 = B[2,0]
andB3 = B[0,0]
. This recalculates all paths from vertices 4, 5 into 0, 1 through 0 and 1. - To recalculate
B[0,3]
we invoke theProcedure
withB1 = B[0,3]
,B2 = B[0,0]
andB3 = B[0,3]
. This recalculates all paths from vertices 0, 1 into 6, 7 through 0 and 1.
You might have already noticed — recalculation of blocks B[2,0]
and B[0,3]
relies on the data from block B[0,0]
in the same way as recalculation of block B[2,3]
relies on blocks B[2,0]
and B[0,3]
.
Luckily (again), prior to recalculating the paths in blocks B[2,0]
and B[0,3]
, we can recalculate paths in B[0,0]
using … the same Procedure
but (again) with different input parameters (see Picture 5):
- To recalculate
B[0,0]
we invoke the Procedure withB1 = B[0,0]
,B2 = B[0,0]
andB3 = B[0,0]
. This recalculates all paths from vertices 0, 1 into 0, 1 through 0 and 1.
It might look surprising a bit (to recalculating the block by the block itself) but if you remember, we inferred the code of the Procedure
from the Floyd-Warshall algorithm, which means, when all input parameters are set to the same block, the Procedure
completely resembles the Floyd-Warshall algorithm and recalculates paths within the block:
function Procedure(B, B, B) do
for k = 0 to L - 1 do
for i = 0 to L - 1 do
for j = 0 to L - 1 do
B[i, j] = min(B[i, j], B[i, k] + B[k, j])
end for
end for
end for
end function
In combination, the process of recalculation of paths from vertices 4, 5 to vertices 6, 7 through vertices 0 and 1 is the following:
- Invoke
Procedure
withB1 = B[0,0]
,B2 = B[0,0]
andB3 = B[0,0]
to calculate all paths from vertices 0, 1 to vertices 0, 1 (represented by a blockB[0,0]
) through vertices 0 and 1 (see Picture 6a). - Invoke
Procedure
with B1 = B[0,3], B2 = B[0,0] and B3 = B[0,3] to calculate all paths from vertices 0, 1 to vertices 6, 7 (represented by a blockB[0,3]
) through vertices 0 and 1 (see Picture 6b). - Invoke
Procedure
withB1 = B[2,0]
,B2 = B[2,0]
andB3 = B[0,0]
to calculate all paths from vertices 4, 5 to vertices 0, 1 (represented by a blockB[2,0]
) through vertices 0 and 1 (see Picture 6c). - Invoke
Procedure
with B1 = B[2,3], B2 = B[2,0] and B3 = B[0,3] to calculate all paths from vertices 4, 5 to vertices 6, 7 (represented by a blockB[2,3]
) through vertices 0 and 1 (see Picture 6d).
In code, these steps can be represented by four sequential invocations of the Procedure
:
Procedure(B[0,0], B[0,0], B[0,0])
Procedure(B[0,3], B[0,0], B[0,3])
Procedure(B[2,0], B[2,0], B[0,0])
Procedure(B[2,3], B[2,0], B[0,3])
Interestingly, by slightly adjusting the above code we can recalculate paths from vertices 4,5 to 2,3 through 0 and 1 (all we need to do, is to replace B[0,3
] with B[0,1]
):
Procedure(B[0,0], B[0,0], B[0,0])
Procedure(B[0,1], B[0,0], B[0,1])
Procedure(B[2,0], B[2,0], B[0,0])
Procedure(B[2,1], B[2,0], B[0,1])
… and if we replace B[0,1]
with B[0,2]
… I think you got the idea. We can continue and calculate all paths between all vertices through vertices 0 and 1 (see Picture 7):
In code, the implementation is just a repetition of the same steps but for different blocks:
// Recalculate block B[0,0] (aka "diagonal" block)
//
Procedure(B[0,0], B[0,0], B[0,0])
//
// Recalculate blocks B[0,1], B[0,2] and B[0,3] (aka "horizontal" blocks)
//
for i = 1 to 4 do
Procedure(B[0,i], B[0,0], B[0,i])
end
//
// Recalculate blocks B[1,0], B[2,0] and B[3,0] (aka "vertical" blocks)
//
for i = 1 to 4 do
Procedure(B[i,0], B[i,0], B[0,0])
end
//
// Recalculate blocks:
// B[1,1], B[1,2], B[1,3],
// B[2,1], B[2,2], B[2,3],
// B[3,1], B[3,2], B[3,3]
// (aka "peripheral" blocks)
//
for i = 1 to 4 do
for j = 1 to 4 do
Procedure(B[i,j], B[i,0], B[0,j])
end
end
Plain and simple. This code recalculates all paths between all vertices through vertices 0 and 1.
To recalculate remaining combinations of vertices we need to … repeat the above algorithm but starting from blocks B[1,1]
, B[2,2]
and B[3,3]
as “diagonal” blocks (see Picture 8):
In code, the complete, Blocked Floyd-Warshall algorithm looks surprisingly simple:
algorithm BlockedFloydWarshall(B) do
// Iterate over all "diagonal" blocks
//
for m = 0 to M do
// Recalculate "diagonal" block
//
Procedure(B[m,m], B[m,m], B[m,m])
//
// Recalculate "horizontal" blocks
//
for i = 0 to M do
//
// Here, we jump over the "diagonal" block
//
if i != m then
Procedure(B[m,i], B[m,m], B[m,i])
end
end
//
// Recalculate "vertical" blocks
//
for i = 0 to M do
//
// Here, we jump over the "diagonal" block
//
if i != m then
Procedure(B[i,m], B[i,m], B[m,m])
end
end
//
// Recalculate "peripheral" blocks
//
for i = 0 to M do
//
// Here, we jump over the all "horizontal" blocks
//
if i != m then
for j = 0 to 4 do
//
// Here, we jump over the all "vertical" blocks
//
if j != m then
Procedure(B[i,j], B[i,m], B[m,j])
end
end
end
end
end
end
// where B - is a block matrix of M x M size
The algorithm iterates over all “diagonal” blocks and sequentially recalculates all paths between all vertices through vertices of the “diagonal” block.
Eventually, algorithm recalculates all paths between all vertices through all vertices.
That is why both algorithms have the same space and time complexities of O(n2) and O(n3) respectively.
– Author’s note
Now, when we are done with the theory, it is time to implement it in C#.
Experimental Environment
All experiments in this post are executed on a single randomly generated complete graph of 4800 vertices.
The experimental machine was equipped with 2xIntel Xeon CPU E5-2620 v4 (2.10GHz, 16 physical and 32 logical cores in total) under control of Windows Server 2019 operating system.
We use the block of size 120×120 because it has been found to demonstrate best performance characteristics on the experimental hardware [3].
The code was compiled and executed used .NET SDK 8.0 and .NET Runtime 8.0 (x64, AVX2). You can find the all code in the GitHub repository.
All implementations of Floyd-Warshall algorithm in this post are identical to the implementations from the previous post (except minors variable and method names).
Basic Implementation (aka Baseline)
We start the C# implementation from implementation of the Procedure
:
static void Procedure(
Span<int> B1,
Span<int> B2,
Span<int> B3,
int block_size)
{
for (var k = 0; k < block_size; ++k)
{
for (var i = 0; i < block_size; ++i)
{
for (var j = 0; j < block_size; ++j)
{
var distance = B2[i * block_size + k] + B3[k * block_size + j];
if (B1[i * block_size + j] > distance)
{
B1[i * block_size + j] = distance;
}
}
}
}
}
The code is almost an exact translation of previously demonstrated pseudo-code into C#. The only noticeable difference is the representation of all input blocks as a combination of Span<T>
and size of the block.
If you are a bit confused how it is possible and why we access an array by multiplying things, don’t be, I will explain it in a moment.
– Author’s note
The algorithm itself looks just a bit different from the pseudo-code (we denote this implementation as “Baseline”):
private static void Baseline(
int[] B,
int block_count,
int block_size)
{
var lineral_block_size = block_size * block_size;
var lineral_block_row_size = block_count * lineral_block_size;
for (var m = 0; m < block_count; ++m)
{
var offset_mm = m * lineral_block_row_size + m * lineral_block_size;
var mm = new Span<int>(B, offset_mm, lineral_block_size);
//
// Calculate "diagonal" block
//
Procedure(mm, mm, mm, block_size);
//
// Calculate "horizontal" and "vertical" blocks in the same loop
//
for (var i = 0; i < block_count; ++i)
{
if (i != m)
{
var offset_im = i * lineral_block_row_size + m * lineral_block_size;
var offset_mi = m * lineral_block_row_size + i * lineral_block_size;
var im = new Span<int>(B, offset_im, lineral_block_size);
var mi = new Span<int>(B, offset_mi, lineral_block_size);
Procedure(im, im, mm, block_size);
Procedure(mi, mm, mi, block_size);
}
}
//
// Calculate "peripheral" blocks
//
for (var i = 0; i < block_count; ++i)
{
if (i != m)
{
var offset_im = i * lineral_block_row_size + m * lineral_block_size;
var im = new Span<int>(B, offset_im, lineral_block_size);
for (var j = 0; j < block_count; ++j)
{
if (j != m)
{
var offset_ij = i * lineral_block_row_size + j * lineral_block_size;
var offset_mj = m * lineral_block_row_size + j * lineral_block_size;
var ij = new Span<int>(B, offset_ij, lineral_block_size);
var mj = new Span<int>(B, offset_mj, lineral_block_size);
Procedure(ij, im, mj, block_size);
}
}
}
}
}
}
You definitely have noticed, that Baseline
and Procedure
calculate a lot of offsets, which are then used to create Span's
and address values within the Span
‘s.
Here is how it works. Imagine a 8×8 square matrix W
(see Picture 9a). In our heads, we see matrix as a square but in memory, we can represent “square” as an array of 8×8 = 64 values (this is called a “row-major” or sometimes “lineal” representation), where every value can be accessed using a simple formula:
This is exactly what we do in the Procedure
to access block values (we use block_size
instead of 8
constant from the formula).
When it comes to the block matrix, we can apply absolutely the same thinking. Imagine a 4×4 block matrix B where every block is a 2×2 square matrix (see Picture 9b). We can represent this matrix as an array of (2×2)x(4×4) = 4×16 = 64 values (again), where every block is put into array using the same logic we applied to matrix W
, one by one. This allows us to find “beginning” of any block using the formula:
You can see the 16
and 4
calculated inside the Baseline
as lineral_block_row_size
and lineral_block_size
respectively.
These manipulations are essential for algorithms performance. Lineal representation is extremely cache friendly (especially in our case, where we access elements sequentially most of the time). They also allow us to access any block of the matrix B
(and any value in the any block of matrix B
) in constant time, which is also health for performance.
Now let’s see Baseline in action.
Experimental Results
Here are the experimental results of the Baseline implementation:
Algorithm (Baseline) |
Graph |
Mean (s) |
Error (s) |
StdDev (s) |
LLC Miss / Op |
---|---|---|---|---|---|
Floyd-Warshall |
4800 |
172.881 s |
2.0421 s |
1.9101 s |
6,903,951,087 |
Blocked Floyd-Warshall |
4800 (120) |
200.075 s |
0.0578 s |
0.0540 s |
59,423,676 |
Table 3. Experimental results of “Baseline” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms.
Surprisingly, the Blocked Floyd-Warshall is about 15% slower than Floyd-Warshall. This is unexpected … or not?
Think about the code of both algorithms. It is trivial – read three values from memory, sum two values, compare two values, and write one value back.
Modern CPU’s are extremely optimised for single-threaded execution. They have multi-stage pipelines and can process multiple operations (with no data dependencies) simultaneously. They can detect lineal memory access (which is why we use it in both algorithms) patterns and automatically pre-load data from main memory into cache (hardware pre-fetch).
All these factors united, produce the result when algorithm with cache-friend memory access (and if you take a look at LLC Miss / Op column which shows number Last Level Cache misses, it is obvious that Blocked Floyd-Warshall is more cache friendly) has no performance benefits or even worse, works slower because it does additional work to manipulate the data in the right way (for instance, instantiation of Span's
, stack push-pop, extra loops and so on).
We can run the same code under a profiler (for instance Intel VTune) and clearly see that none of the algorithm is “bound” (i.e. limited) by the memory:
Algorithm (Baseline) |
Clock Ticks |
Instructions Retired |
CPI Rate |
Memory Bound |
---|---|---|---|---|
Floyd-Warshall |
364,956,900,000 |
1,771,270,200,000 |
0.206 |
3.0% |
Blocked Floyd-Warshall |
449,603,700,000 |
1,925,290,500,000 |
0.234 |
0.0% |
Table 4. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Baseline” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms.
I have extracted four (out of many) parameters:
- Clock Ticks – is the number of CPU clock ticks spent executing our algorithm.
- Instructions Retired – is the number of Instructions CPU processed.
- CPI Rate – is the number of Clock Ticks / Instructions Retired.
- Memory Bound – is the high-level, aggregative metric which represents impact of memory (when CPU is waiting for data to come up from memory) on the execution.
You can see, that the number of Instructions Retired is slightly bigger for the Blocked Floyd-Warshall algorithm compared to Floyd-Warshall. This is expected because it does a bit more job to arrange calculations. They both have low CPI Rate, which means that every “cycle” (i.e. clock tick) the CPU was able to process up to five instructions.
These two observations in combination with low Memory Bound basically tell us, that CPU had worked to its full capacity without stalling at all.
Let’s see, if we can make the Blocked Floyd-Warshall algorithm to outperform Floyd-Warshall.
Vectorised Implementation (aka “Vector”)
The first thing we can do to improve performance of the purely CPU bound algorithm is to force CPU to do even more operations simultaneously. Yes, I am talking about vectorisation.
In the previous post, you have already saw the vectorised implementation of Floyd-Warshall algorithm and if you didn’t, then don’t worry because it is almost a complete copy of the Procedure
:
static void Procedure(
Span<int> B1,
Span<int> B2,
Span<int> B3,
int block_size)
{
for (var k = 0; k < block_size; ++k)
{
for (var i = 0; i < block_size; ++i)
{
// Read vector of values from B2
// (to get existing paths from "i -> k")
//
var ik_vec = new Vector<int>(B2[i * block_size + k]);
//
// Vectorized loop
//
var j = 0;
for (; j < block_size - Vector<int>.Count; j += Vector<int>.Count)
{
// Read vector of values from B1
// (to get existing paths "i -> j")
//
var ij_vec = new Vector<int>(
B1.Slice(i * block_size + j, Vector<int>.Count));
//
// Read vector of values from B3
// (to get existing paths "k -> j")
// and sum them with values from B2
// (to get new paths "i -> k -> j")
//
var ikj_vec = new Vector<int>(
B3.Slice(k * block_size + j, Vector<int>.Count))
+ ik_vec;
//
// Compare vector of existing paths from B1
// with a vector of new paths
//
var lt_vec = Vector.LessThan(ij_vec, ikj_vec);
if (lt_vec == new Vector<int>(-1))
{
continue;
}
//
// Copy all "i -> k -> j" paths which are smaller than
// existing "i -> j" paths back into B1 block
//
var r_vec = Vector.ConditionalSelect(lt_vec, ij_vec, ikj_vec);
r_vec.CopyTo(B1.Slice(i * block_size + j, Vector<int>.Count));
}
//
// Non-vectorized loop.
//
for (; j < block_size; ++j)
{
var distance = B2[i * block_size + k] + B3[k * block_size + j];
if (B1[i * block_size + j] > distance)
{
B1[i * block_size + j] = distance;
}
}
}
}
}
As previously, we implement vectorisation using – Vector and Vector<T> types. The implementation has two loops: one vectorised (to calculate majority of paths) and one non-vectorised (to calculate remaining paths when size of the block is not dividable by size of the vector).
There is nothing to vectorise in the main algorithm and because interface of the Procedure
doesn’t change, we can plug in the new code and run the experiments.
Experimental Results
Here are the experimental results of the Vector implementation:
Algorithm (Vector) |
Graph |
Mean (s) |
Error (s) |
StdDev (s) |
LLC Miss / Op |
---|---|---|---|---|---|
Floyd-Warshall |
4800 |
73.251 s |
0.4993 s |
0.4671 s |
6,353,509,854 |
Blocked Floyd-Warshall |
4800 (120) |
64.792 s |
0.0318 s |
0.0282 s |
50,703,019 |
Table 5. Experimental results of “Vector” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms.
This time Blocked Floyd-Warshall is around 12% faster than Floyd-Warshall algorithm.
Running the same code under the profiler also presents us with a different picture:
Algorithm (Vector) |
Clock Ticks |
Instructions Retired |
CPI Rate |
Memory Bound |
---|---|---|---|---|
Floyd-Warshall |
161,668,500,000 |
490,026,600,000 |
0.330 |
6.8% |
Blocked Floyd-Warshall |
144,314,100,000 |
488,193,300,000 |
0.296 |
0.3% |
Table 6. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Vector” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms
We can see a significant reduction of the Instructions Retired in both algorithms (which is expected because vector instructions do multiple operations at once and CPU does multiple vector instructions per clock tick).
We can also see the change in the Memory Bound – it is doubled for Floyd-Warshall and almost not changed for Blocked Floyd-Warshall algorithm.
Still, the CPI Rate is low which means (again), the CPU is not stalling (almost). Therefore, there is a room for improvement.
Let’s see, if we can apply parallelism here.
Parallel-Vectorised Implementation (aka “Parallel Vector”)
Parallelisation of Blocked Floyd-Warshall algorithm is a pure form of “Task Parallelism”, where “Task” is a recalculation of a single block (single invocation of the Procedure
).
The simplest approach for implementation is to use message passing (i.e. sending custom messages), CountdownEvent
and ThreadPool
.
Here is the code of the custom message:
private readonly struct ParallelMessage
{
// A reference to a countdown event to signal
// when the message is processed
//
public readonly CountdownEvent Event;
public readonly int[] Matrix;
public readonly int OffsetB1;
public readonly int OffsetB2;
public readonly int OffsetB3;
public readonly int BlockSize;
public readonly int SpanLength;
public ParallelMessage(
CountdownEvent Event,
int[] Matrix,
int OffsetB1,
int OffsetB2,
int OffsetB3,
int BlockSize,
int SpanLength)
{
this.Event = Event;
this.Matrix = Matrix;
this.OffsetB1 = OffsetB1;
this.OffsetB2 = OffsetB2;
this.OffsetB3 = OffsetB3;
this.BlockSize = BlockSize;
this.SpanLength = SpanLength;
}
}
The message is a readonly struct
(for performance reasons, to avoid allocations) which includes all information required to initialise three Span's
for B1
, B2
and B3
blocks and invoke the Procedure
.
The process of conversion of ParallelMessage
into the invocation of the Procedure is handled by an ParallelProcedure
:
static void ParallelProcedure(
ParallelMessage message)
{
var B1 = new Span<int>(message.Matrix, message.OffsetB1, message.SpanLength);
var B2 = new Span<int>(message.Matrix, message.OffsetB2, message.SpanLength);
var B3 = new Span<int>(message.Matrix, message.OffsetB3, message.SpanLength);
Procedure(B1, B2, B3, message.BlockSize);
// Signal the message has been processed
//
message.Event.Signal();
}
You can see, that besides information about the blocks, the ParallelMessage
includes a reference to CountdownEvent
, which ParallelProcedure
signals when calculations are complete.
Here is the code:
private static void ParallelVectorOptimisations(
int[] matrix,
int block_count,
int block_size)
{
var iteration_sync = new CountdownEvent(0);
var lineral_block_size = block_size * block_size;
var lineral_block_row_size = block_count * lineral_block_size;
for (var m = 0; m < block_count; ++m)
{
var offset_mm = m * lineral_block_row_size + m * lineral_block_size;
var mm = new Span<int>(matrix, offset_mm, lineral_block_size);
Procedure(mm, mm, mm, block_size);
// We calculate how many signals we expect to receive
// (when all signals are received the event itself becomes signalled)
//
// We expect to have one row of blocks...
// ("horizontal" blocks except diagonal block)
//
// ...and one column of blocks
// ("vertical" blocks except diagonal block)
//
// Hence, 2 * block_count - 2 = 2 * (block_count - 1)
//
iteration_sync.Reset(2 * (block_count - 1));
for (var i = 0; i < block_count; ++i)
{
if (i != m)
{
var offset_im = i * lineral_block_row_size + m * lineral_block_size;
var offset_mi = m * lineral_block_row_size + i * lineral_block_size;
var message_x = new ParallelMessage(
iteration_sync, matrix, offset_im, offset_im, offset_mm,
block_size, lineral_block_size);
var message_y = new ParallelMessage(
iteration_sync, matrix, offset_mi, offset_mm, offset_mi,
block_size, lineral_block_size);
ThreadPool.QueueUserWorkItem(ParallelProcedure, message_x, false);
ThreadPool.QueueUserWorkItem(ParallelProcedure, message_y, false);
}
}
// Here we are waiting for all "horizontal" and "vertical" blocks
// to be recalculated
//
iteration_sync.Wait();
// Now we expect all blocks except one column and one row
// which essentially mean if we have a 4x4 matrix of blocks
// we expect to recalculate 3x3 blocks
//
// Hence, (block_count - 1) * (block_count - 1)
//
iteration_sync.Reset((block_count - 1) * (block_count - 1));
for (var i = 0; i < block_count; ++i)
{
if (i != m)
{
var offset_im = i * lineral_block_row_size + m * lineral_block_size;
for (var j = 0; j < block_count; ++j)
{
if (j != m)
{
var offset_ij = i * lineral_block_row_size + j * lineral_block_size;
var offset_mj = m * lineral_block_row_size + j * lineral_block_size;
var message = new ParallelMessage(
iteration_sync, matrix, offset_ij, offset_im, offset_mj,
block_size, lineral_block_size);
ThreadPool.QueueUserWorkItem(ParallelProcedure, message, false);
}
}
}
}
// Waiting for all "peripheral" blocks to be recalculated
// before moving to the next iteration, with next "diagonal" block
//
iteration_sync.Wait();
}
}
The implementation is straightforward (except, maybe the usage of CountdownEvent
and .NET concurrency primitives).
Let’s see how combination of vectorisation and parallelism impacts the performance.
Experimental Results
Here are the execution results:
Algorithm (Parallel Vector) |
Graph |
Mean (s) |
Error (s) |
StdDev (s) |
LLC Miss / Op |
---|---|---|---|---|---|
Floyd-Warshall |
4800 |
32.311 s |
0.0542 s |
0.0480 s |
4,277,045,385 |
Blocked Floyd-Warshall |
4800 (120) |
3.396 s |
0.0014 s |
0.0013 s |
90,435,311 |
Table 7. Experimental results of “Parallel Vector” implementation of Floyd-Warshall and Blocked Floyd-Warshall algorithms.
This isn’t a mistake. Introduction of the parallelism into Blocked Floyd-Warshall algorithm results in almost x20 performance improvement compared to vectorised version, while Floyd-Warshall speedup is much more moderate – around x2.
Running the code under the profiles reveals the reasons:
Algorithm (Vector) |
Clock Ticks |
Instructions Retired |
CPI Rate |
Memory Bound |
---|---|---|---|---|
Floyd-Warshall |
2,038,598,100,000 |
606,769,800,000 |
3.360 |
78.9% |
Blocked Floyd-Warshall |
254,444,400,000 |
483,594,300,000 |
0.526 |
0.0% |
Table 8. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Vector” implementations of Floyd-Warshall and Blocked Floyd-Warshall algorithms.
The Memory Bound is crushing 78.9% – which essentially means, most of the time, the CPU was waiting for data from memory instead of doing any kind of calculations.
This might seem unexpected because before the stats of both algorithms were quite close. But don’t forget the important part – modern CPU’s are extremely good at single-threaded execution but it is very different when it comes to multi-threading.
Just look at the results of parallel (non-vectorised) version of Floyd-Warshall algorithm:
Algorithm (Parallel) |
Graph |
Mean (s) |
Error (s) |
StdDev (s) |
LLC Miss / Op |
---|---|---|---|---|---|
Floyd-Warshall |
4800 |
27.629 s |
0.0101 s |
0.0094 s |
4,571,752,038 |
Table 9. Experimental results of “Parallel” implementation of Floyd-Warshall algorithm.
It is FASTER (27.629 seconds vs 32.311 seconds) than vectorised version because it puts less pressure on memory:
Algorithm (Parallel) |
Clock Ticks |
Instructions Retired |
CPI Rate |
Memory Bound |
---|---|---|---|---|
Floyd-Warshall |
1,907,472,000,000 |
2,449,542,900,000 |
0.779 |
30.5% |
Table 10. Clock Ticks, Instructions Retired, CPI Rate and Memory Bound parameters from the output of Intel VTune profiler for “Parallel” implementations of Floyd-Warshall algorithms.
The Floyd-Warshall algorithm is bound by memory and its scaling is limited (even using more cores will result in just a tiny speedup). However, this doesn’t apply to Blocked Floyd-Warshall which actually CAN BE improved because currently it is not limited nor by CPU nor by memory. But this is a completely different story.
Conclusion
In this post we implemented one more algorithm for solving all-pairs shortest path problem. Along the way, we learned basics of caching, reiterated on linear memory access, vectorisation and basic concurrency capabilities of .NET.
We also saw with our own eyes how fast can be cache-aware (aka cache-friendly) algorithms, especially when it comes to parallel execution.
I hope you enjoyed the reading and THANK YOU for staying to the end.
References
- You can find mentioned latency values and more on 7-Zip LZMA Benchmarks.
- Venkataraman, G., Sahni, S., Mukhopadhyaya, S. A Blocked All-Pairs Shortest Paths Algorithm // Journal of Experimental Algorithmics (JEA). Vol 8. 2003. P. 857 – 874.
- Karasik, O. N., and A. A. Prihozhy. “Tuning Block-Parallel All-Pairs Shortest Path Algorithm For Efficient Multi-Core Implementation.” Системный анализ и прикладная информатика 3 (2022): 57-65.