Optimizing Custom Workloads With RISC-V

Transcript

Henry: The talk is going to be about optimizing custom workloads with RISC-V. I’m sure that you’ve all heard about the latest developments in the tech industry around AI. I think it’s been all around, and how we’re actually integrating it absolutely everywhere, like from search engine, cars, even dishwashers. I’m sure that it’s all something that we’re spending a lot of time developing the software for it. Getting access to the hardware has been very tough. We just don’t have enough hardware. How many of you are using NVIDIA on the cloud and are struggling to get access to them? The good news here is there’s really a boom in the development of alternatives to NVIDIA, and viable and credible alternatives to the current main GPU providers.

For example, AMD with their MI300 and upcoming MI350. Intel with Advanced Matrix Extensions, and the Gaudi platform. ARM with their Scalable Matrix Extension, and plenty of other hardware startups among which I’m part of. In this talk, I want to talk about RISC-V. How it fits into this rapidly evolving ecosystem. How the community is working together, making it a reality, and making it able to tackle the challenge of today and tomorrow. How you can take advantage of it and participate in it.

I’m Ludovic. I’m a software engineer and team lead at Rivos. We’re a hardware company based in Santa Clara. My role there is to manage the managed runtime teams. We take care of OpenJDK, Python, and Go. Of the system libraries team, we’re, for example, contributing to OpenBLAS, and contributing libraries for our accelerator. Also, looking at everything profiling. Rivos is a hardware company making a CPU with an accelerator. We’re a hardware company. Our goal is to sell basically a server. Nonetheless, we’re investing very heavily in software, because we understand that a box without the software in it is just a box. It allows you to hold some books. That’s pretty much it.

As part of that, I’m also a Language Runtimes Working Group Lead at RISE, which I’m going to go into more detail afterwards. RISE is a collaborative effort to accelerate the development of open-source software for the RISC-V architecture. Here the goal is really to develop software for RISC-V. The focus of this working group is on, again, everything OpenJDK, Go, Python, .NET, JavaScript, REST, WebAssembly. I’m not the only one working on all of that, there are quite a few members contributing to that. In these topics, we’re really looking at investing in the compilers.

For example, in OpenJDK C1, C2. In the runtimes, so class loading, everything that you all know and love about Java. Also, the ecosystem, which is very important. It’s not enough to just have the runtime working, like OpenJDK, the Python runtime, the Go runtime. You need all the libraries: all the Python libraries, all the Go libraries, all the Java libraries also to be available. For example, Apache Spark. We went and submitted patches for all the dependencies of Apache Spark which have a native dependency. We made sure it’s available on RISC-V. If tomorrow you want to use Apache Spark on RISC-V, it just works because the dependency is available.

AI’s Expanding Influence and Hardware Demands

Just to come back to AI. AI is everywhere. I think at least 60% or 70% of the presentations at QCon are about AI. It’s from search engines with, for example, Google Gemini. Even OpenAI now has a search engine. It also has an autonomous system, for example, Waymo with all the cars out there, it’s pretty impressive what they’re able to do. IoT, for example, Apple Intelligence. Not exactly IoT, it’s a smartphone. The idea is it’s very small, has to have a very low energy usage.

Then, of course, you have OpenAI, and Anthropic, and a few others, which provide a lot of big LLM models. We can see that AI is really driving demand for high performance and very specialized hardware. I mentioned before NVIDIA, for example. Why is that? The problem is, here we can see a graph of the petaflop demand for different models along the years. We can see that a while back, it would not even use the petaflops of total compute to train the model. You can also see that starting more or less in 2010, 2015, there was a huge acceleration of the need for compute. The problem is, CPUs are able to do teraflops of operations.

Basically, a thousand, billion floating-point operations per second. Given that the need now is in the petaflops, so a million, billion floating-point operations per second, you really need something faster. That’s where GPUs lately have entered the petaflops era. Where, for example, an NVIDIA H100 is able to do 3.9 petaflops per GPU, so that’s 3.9 million, billion floating-point operations per second. It’s an absolutely crazy number, because it’s literally 4,000 times faster than a CPU. It’s still very small. It’s not a mainframe. Where it gets from that is really from having very dedicated silicon, called Tensor Cores, on the GPU that allows to do these very specific operations.

The limited availability and the high cost of GPU is still a big issue, like I mentioned earlier. An NVIDIA DGX H100, so basically just a server with a bunch of GPUs in it, costs $250,000, more or less. To have a rack of that, so basically a fridge size full of GPUs, consumes around 120 kilowatts of energy. That’s per rack. The problem with that is in data centers until like five years ago, a normal rack would consume around 24 kilowatts.

Basically, a fifth of what’s needed here. What that means is all the existing data centers just can’t feed that. Now you need to build whole new data centers with whole new levels of energy providers, cooling, and everything that goes with it. Problem is building a data center costs billions. It’s very expensive. Getting access to GPUs to fill up the data center takes years. If you go to NVIDIA today and you want to buy a GPU, they’re going to put you at the back of the list and you’re going to get it delivered in a year. Access to GPUs on the cloud is still a big pain. If you want, you have spot instances, but they are not always available and it’s hard to come by, basically. Luckily, a lot of startups are coming and starting to have new offerings, which challenges NVIDIA in that sense.

Current Solutions and Their Limits

In the existing solutions, most of them are proprietary. You have, for example, NVIDIA. On the left here, it is basically an exploded view of a DGX server with the different switches and everything. You can see it’s a pretty complex and complete offering. In the middle, you have, for example, the Intel Gaudi offering, which is an accelerator from Intel. You can see, for example, you have an offering from AMD, which is not shown here with the MI300, MI350. You also have the Google TPUs. You have AWS Inferentia. You have AWS Trainium. You have Microsoft Maia. You can see there is a bunch of options, but the problem is they’re all proprietary.

Also, the software stack is a big pain point. NVIDIA has an amazing software stack with CUDA. Everything supports CUDA, PyTorch, like TensorFlow, everything supports CUDA. Not everything is going to support, for example, the software stack of AMD, of Intel, of AWS, of Microsoft. All of these companies, they need to go into all these widely used projects and make sure that they slide in their ROCm backend, for example, for AMD, or Microsoft’s software stack to use the accelerator. They have to go and retrofit all of this in PyTorch and TensorFlow, and then hand it to thousands of other projects out there, software projects, that would need that.

NVIDIA really has a very tight grip on all of this. That’s really where the dominance of GPU comes from. That’s also why NVIDIA is the most valuable company at the moment. I think it was worth $3.6 trillion. I think the second most valuable company is Apple. A lot of people have an iPhone, not everyone has an NVIDIA GPU. The domination has no end in sight. No one has today an offering that is able to compete with them at the moment. AMD has an amazing GPU, or accelerator, but it’s not there yet. NVIDIA dominance comes from amazing performance, like they’re some of the fastest, the fastest with AMD, and the completeness of their offering. They have GPUs. They have interconnect to connect the GPUs together. They have cooling. They have servers. They have everything.

Basically, you go, you buy a rack and you sign a big check, and you get something fast and complete, and the software works. That’s where really NVIDIA has a moat. CUDA is used everywhere, everyone uses CUDA. AMD is making some progress with their ROCm stack, but it’s not there yet. It’s the second best, but it’s not there yet.

Then, Intel, Google, AWS, Microsoft, and everything, they also have a lot of work to do. Another challenge is the cost challenge. Like I said, it’s a bit more than a million dollars per rack, like the $250,000 worth per server. You have 72 servers per rack, so it’s more than a million dollars for one rack. That’s only the capital expenditure. That’s only how much you pay to have the rack. Because then you need to pay for the electricity and the cooling for the 120 kilowatts of energy. That costs a lot of money. For example, xAI recently announced they delivered a big GPU cluster. There’s 100,000 GPUs. It cost them multi-billion dollars to have the data center and all the GPUs that comes in it, and all the interconnect, and all the cooling, the redundancy, and everything. It’s a very big business. It’s very costly.

The second challenge is really the power. Like I said earlier, a GPU consumes around 700 watts. That’s for one GPU. You need to build a whole new data center to be able to hold that. That’s for the current generation of GPUs. The next generation of NVIDIA GPUs is going to consume more than 1000 watts per GPU again. Even the data centers which were built for this generation of GPUs are not good anymore. We need to build bigger.

Again, that’s a data center. It’s a whole building with the cooling, the energy, and everything. It’s just like you have to have this constant rebuilding of things, which is extremely costly. Here we have a graph, like a projection from Goldman Sachs on the usage requirements. What we can see is that until 2019, early 2020, the data center power demand was fairly stable. It’s not that the number of machines was stable. It’s just that even though there were more machines, they were using less energy. It was flat, the overall energy demand. What we see with the explosion of GPUs and GPUs in the cloud, like 2020, 2022, there’s a bit of COVID in there as well, we see really the energy demand is just exploding. That’s just for the data center.

Goldman Sachs estimate that by 2030, data centers will use 8% of the total energy in the U.S. compared to 3% in 2022. The share of energy used by data centers will more than double, knowing that the overall market for energy is also going up. The power is a big challenge for the current GPU and AI wave we are riding right now.

Overall, we need solutions because the way we are going, it’s good for now, that’s what we have, but it’s not viable in the long-term. We need to do better. We need adaptable, cost-efficient, and open solutions. One of the main things is they need to be sized and optimized for the problem at hand. You’re not going to have an NVIDIA GPU in an iPhone. It just requires different needs. You cannot have 1000-watt GPUs everywhere. You cannot have it on a dishwasher. You cannot have it on a toothbrush. Not every application and model require petaflops. We also need to have smaller things for things that don’t need that much compute. Not everything requires training speeds with data center size scale-out.

For example, you don’t need 100,000 GPUs for everything. If you are doing a recommendation for a small website, you don’t need 10,000 GPUs. You don’t need a fully connected cluster of GPUs. You need something smaller, more cost-efficient for your specific needs. That’s largely going to be driven by more competition, which is going to drive lower cost and more availability.

RISC-V: A Modular, Open Approach

That’s where RISC-V comes to play, the way I see it. What is RISC-V? RISC-V is an open standard ISA. It’s free to use and customize. What is an ISA? An example of other ISAs are x86 and ARM64. These are basically a specification of assembly instructions of how a machine should behave, what registers are available. Basically, it’s a specification for what the machine should look like so that you can write the software for it. Vendors can come, and by implementing this ISA, they are able to just have the software work on top of it. It’s where software and hardware meet.

If both respect the specification, then everything just works together. The foundation taking care of RISC-V is called RISC-V International or RVI. There is more than 4,600 members, so companies, universities, also individuals. Some example members are Rivos – so the company I work for – Google, NVIDIA, Intel, SiFive, Qualcomm, AMD, Huawei, Synopsys. There’s plenty of hardware companies or hardware related companies being part of it, and even companies which have nothing to do with hardware just because they want to be part of it. The building block of RISC-V is really extensions. An extension is going to specify something.

Then, by combining all these extensions, you are going to have the RISC-V spec. Then, vendors or hardware manufacturers like Rivos or SiFive, for example, are going to say, I’m going to make a hardware with this extension, this extension, this extension. Any software that respects that is going to work on it. The idea of the extensions is they are proposed and discussed and ratified by RVI members.

For example, we Rivos, we can go to RVI and say, I want to add an instruction on doing this, this, and that, useful in this context. Here’s a software that proof points that I can do that. Then we are going to discuss it with Google, with Intel, with NVIDIA, with whoever wants to be part of it, and we are going to ratify it together. Once everyone agrees on the shape and form and language and everything of the spec, it’s getting part of the RISC-V spec, and everyone can go and implement it.

The other interesting things about RISC-V is that it really scales from IoT to supercomputers. For example, if anyone has an iPhone, you have RISC-V CPUs in there. There is not just one CPU in an iPhone, for example. A phone is composed of dozens of processors. There’s, for example, sound processors, there are DSP processors, there’s the Bluetooth processor and everything. We know for a fact that some of them are RISC-V, for example, because it’s just free. Apple knows how to make CPUs, so why not use RISC-V and not just pay fees to ARM for all of these processors? There are already billions of processors on SoC being shipped yearly.

I think in 2024, it was like 4 billion or 5 billion System on Chips are being shipped using RISC-V. The forecast is that by 2031, it will represent around 20 billion SoCs being shipped with RISC-V, representing around 25% of the market share. Of course, we hope that data center is going to get a good foothold. Where there’s going to be a much stronger foothold is going to be in everything consumer, like Bluetooth headphones, for example, a big one. It also scales all the way up to supercomputers.

For example, the Barcelona Supercomputing Center is developing a chip based on RISC-V for supercomputers for their own use cases. It really goes from Bluetooth headsets to supercomputers. There is also a growing ecosystem with contributions coming from academia and the industry. For example, Barcelona Supercomputing Center is academia, Rivos is industry. What goes from that is that RVI is very focused on the hardware spec, on defining what the hardware should do.

Then, the complementary foundation to that is called RISE for RISC-V Software Enablement. We are focused exclusively on software. That’s the foundation I lead for one of the working groups. A member of RVI can also be a member of RISE, of course. We see members like Google, Intel, NVIDIA, SiFive, Qualcomm, and Red Hat, and a bunch of others, and Canonical. We’re working on everything software, so from kernel, compilers, libraries, runtimes, tooling, some common firmware, emulators, like anything you can think of that would allow developers to target RISC-V, we are looking at, just to make sure that all of this software would work on RISC-V out of the box. RISE contributes exclusively upstream. We don’t fork anything. We’re not interested in that, because, again, we care that everything just works. We don’t want to keep things for us. Everything we do is open source. Everything is going upstream.

Linear Algebra: The Backbone of AI

That was my introduction to RISC-V. Here, I want to dive back into AI, and really use an example, the example of linear algebra and OpenBLAS, to highlight how it’s working. Linear algebra is really the backbone of AI. It’s the math of AI. It’s all about vector and matrix operations. For example, here on the right, we have a schema of how a Llama model works with the different things. Like for example, you have multi-head attention, which is the pre-fill of this phase.

Then you have the page decode here. Each of these boxes here is going to be a matrix multiplication. It’s going to be linear algebra. It’s not just going to be five, more or less, matrix multiplication for one model. A model is going to be hundreds of a matrix multiplication. For every new token, you are going to have multiple matrix multiplications, and pretty large ones at that. It’s an extremely compute-intensive problem. The complexity is O(n^3), meaning that if you have a matrix of size 4000, you are literally going to have to have 64 billion floating-point operations to have the matrix multiply result. The good news is it’s highly parallelizable. That’s where GPU really shines, because GPUs are great at doing a lot of things in parallel. That’s where you really benefit on GPU, for example, of having Tensor Cores, which are highly specialized pieces of hardware that are going to do this specific problem of matrix multiplication just a lot faster.

Just to give you an example, TinyLlama running with llama.cpp, which is all on CPU, we can observe around 90% of the time spent in matrix multiplication. That’s just for one model. If we speed up by two, the speed of matrix multiplication, we basically cut by half, more or less, the total speed of the model.

Here we’re taking OpenBLAS as an example. Why OpenBLAS? It’s just a critical library for performance optimization in the space of linear algebra. It’s completely open source. It supports many architectures, x86, ARM, PowerPC, LoongArch, RISC-V, and many more. It has kernel optimized for the macro-architectural details, for example, for Skylake, there’s a kernel optimized for that. For Haswell, for Sapphire Rapids, that’s just for Intel. AMD is going to have optimization for Bulldozer, for Zen, for basically every generation of CPUs they have. ARM is going to have optimizations, again, just to make it fast. It’s also used extensively by many projects, among which NumPy, PyTorch, pandas, Apache Spark. It’s basically the de facto, just because the license is permissive and it’s open source. There are some alternatives, like Intel MKL. It’s optimized for Intel CPU by Intel. That’s why it’s only mostly used on Intel. BLIS, which is another alternative. It’s also open source, but it’s not as fast. It’s a bit cleaner, but it’s not as fast.

OpenBLAS Optimization Journey: Basics

What is matrix multiplication? The very basic idea of it is you have three nested for loop. A matrix A is going to be size m times K, B is going to be size n times K, and C, the result, is going to be size M times n. If you want the simplest implementation, it’s going to be that of you do a for loop, you do C plus equals A times B. That’s very much the basics. If you can implement that in Python, it’s not going to be very fast. If you can implement that in C, it’s going to be a bit faster.

This implementation is not going to be that fast because you have to rely on the compiler accelerating things. The premise, the compiler is a generic compiler. It’s not a matrix multiply optimizer. It’s not going to have to know all the tricks you could do to optimize matrix multiply very specifically, like loop unloading, auto-vectorization, and a few others we’re going to see after. OpenBLAS, what does it try to do to go faster? It’s going to use a two-by-two block. It’s going to take these four values here, these four values here, and it’s going to output it into these four values. It just goes a bit faster because by reading only four plus four elements, you do eight multiplications. It’s a bit faster as well. It’s basically O(n^2) versus O(n^3). You read some values, but you reduce the tradeoff. It also does FMA, so Fused Multiply-Add. As you can see, we’re doing a multiplication and an add.

That could be a model instruction followed by an add instruction. The ID, we just use the FMA, so a model plus an add in a single instruction. Instead of taking two cycles, we just take one cycle. It’s pretty basic things, but they are things that the compiler may not know how to do because, again, it’s a generic compiler. It’s not a matrix multiply optimizer. It’s the small things that it does in the generic case.

There’s another problem in exactly how the matrices are stored. There is the concept of row-major versus column-major. Row-major is you first store the rows, you store A11, A12, A13, A21. Column-major is you store it like that. You have a 2D array, in which order do you store the data? That’s basically the idea. BLAS in general, so BLAS, OpenBLAS, BLIS, all of them, they are using column-major as a default because that’s how the original implementation was. That’s how every derivative is implemented. It has of course a huge impact on the memory access, so on cache hits, cache miss, and everything. For small matrices, OpenBLAS has a specific kernel for each shape because you also have the concept of transposed, non-transposed.

The default is column-major, but you may say that my matrix B, for example, is transposed and so it’s actually going to be stored as a row-major. That’s the different things you can have, A and B. A can be transposed, non-transposed. B can be transposed, non-transposed. C is always column-major order. In order to have the four different things for A and B, you have specific kernels for the four cases.

For the larger matrices, it doesn’t work as well because you just have a faster way of doing it. The idea is if A is transposed and N is not transposed, then it’s the fastest case in terms of memory cache accesses. The idea is if A, for example, is not in the right format, you are going to make a copy of A. You are going to make A in the right format, and then you’re going to do the matrix multiply. The idea here is the transpose is fairly cheap compared to the matrix multiply. You are ready to pay this upfront cost to have a faster matrix multiply. That’s the tradeoff you can do. Again, it’s absolutely not some things that the compiler can do because it doesn’t understand any of that. That’s all things that OpenBLAS does by hand in the way that they’re implementing the code.

OpenBLAS Optimization Journey: Vectorization

That was for the generic optimizations. Auto-vectorization helps in a lot of this, but again the compiler cannot optimize the memory access patterns. The effectiveness of auto-vectorization is pretty limited, and so that’s why we require us to write handwritten kernels. That’s where vectorization comes in. The idea of vectorization is you have specific instructions that are able to do one instruction on multiple data. If, for example, you have an array of eight values, instead of doing eight times a multiply followed by eight times an addition, or eight times a Fused Multiply-Add, you are going to use one vectorized FMA to do the eight values all in a single instruction.

The idea is instead of having eight cycles, you’re going to have only one cycle because you have one instruction. I’m simplifying a lot, but that’s the idea. RISC-V Vector Extension was ratified in September 2021. We have started to see hardware shipping with it mid of this year. The hardware is pretty limited, but it allows us to actually start playing with something specific. The vector extension in RISC-V is similar to SVE, the ARM SVE, which stands for Scalable Vector Extension of ARM. The idea is that it’s vector length agnostic, as in, the spec says, here’s how you do a multiply between two vectors. The spec also says the vector can be 64 bits, 128 bits, 256 bits, any value between 64 and 16k as a power of two.

The hardware is free to implement it as, again, 256, 512, and everything. You have ways for performance reasons to make code which is vector length specific. You know the vector is 256 bits, and so you’re going to write instructions for that explicitly. That’s how it’s done in OpenBLAS, again, just for performance reasons. You basically have multiple backends in OpenBLAS, which are going to be selected based on the capabilities of the hardware. If you’re running on a hardware with 128-bit vector, we’re going to use the 128-bit kernel. If we’re on 256 bits, using the 256-bit kernel.

Again, it’s really a tradeoff. You have a lot more code duplication, because, again, 128, 256, 512, but the performance is better. Here, the focus is really on performance, not ease of maintainability. Of course, you want the code to be as clean as possible, but the goal is performance. You just want things to be fast. That’s really one of these cases where so many cycles are spent on that problem, that it’s worth spending a bit more engineering time to save machine time later. That’s where it really needs platform micro-architecture specific optimizations, rather than just letting the compiler do it for you.

OpenBLAS Optimization Journey: Transpose

That was for vectorization. Just to come back to transpose, I want to use it as an example for some of the things that RISC-V also allows us to do. Transpose is an O(n^2) problem with a low cache hit, which is the main problem. It’s mostly a CPU backend bound problem, as in the bottleneck in this is accessing memory. The instructions are fairly simple. If you want to transpose, you load A11, A21, A31, and you just store it. It’s literally just load installs from memory, and you just shuffle things around. The instructions are very simple, but they spend all of their time just waiting for data to arrive from RAM. There are instructions in RISC-V which are very generic, which allow you to shuffle data in given vectors. Given they’re very generic, they’re very costly both in terms of silicon, in terms of energy, and they’re pretty difficult to implement in an optimal manner. The idea here is, what if we have specific instructions that allows us to do this specific operation? The instructions are simpler because we already know what they do.

By being simpler, the silicon is simpler, is smaller, and it’s more power efficient, and it’s lower latency. RISC-V here has a concept of vendor-specific extension. The idea is in the ISA, in the space for instruction, in the encoding for instructions, you have a space reserved for the vendor does whatever they want in there. For example, we, Rivos, we can say, I’m going to define an instruction which is not part of the spec, but which is going to do X, Y, and Z, and it’s going to be in my space. I’m not going to collide with anything else. That’s it. What this allows is really, for Rivos, for example, or for SiFive, or for any of the other vendors, to innovate. Because we can on our own figure out, our customers care about this workload. If you want to accelerate this workload, we can have this instruction.

This instruction is also going to help this other workload. We can put it in silicon. We can update a few software stacks to use it. We can get customer feedback on that. Then we can verify, yes, it’s actually helpful. Yes, it allows for lower energy usage without RVI ever being involved yet. This really allows us to innovate, to iterate without getting bureaucracy in the way. Once we have identified that this instruction is very helpful, the problem is we are not going to be able to optimize and to go into every software in the world and to have the software use our instruction, because, again, this instruction is only available on our hardware. That’s where standardization comes to help.

If you go to RVI, we say, “We have this instruction. It’s very helpful in these use cases. What about we standardize it? Because it’s going to allow other vendors to have access to it. We’re going to be able to have more software supporting it”. It’s really a tradeoff with the carrot being software support. Because if it’s part of the spec, more people are going to do work, and we’re not going to have to do the work for everyone. That’s what we did for transpose. We have been contributing it to RVI and now it’s going to be part basically of the RVI spec. We’re going to be able to go to OpenBLAS and have all of that.

OpenBLAS Optimization Journey: Going Further

Just to go a bit further, there is a matrix multiply extension being discussed also at RVI because matrix multiply is such an expensive problem. Right now, it’s using the vector extension, but the matrix extension can be even faster and even more energy efficient. It’s a bit similar to Tensor Cores, where you basically are able to define vectors or blocks of data and just let the hardware do the matrix multiply for you. You don’t have to run all the loop. You don’t have to do a two-by-two. You don’t have to do all of that, or a four-by-four, eight-by-eight, or anything like this. You just save the hardware, take this data, do the matrix multiply, output in there, and you’re good to go. There are different ways to implement it, like fully integrated facility where we just reuse vector registers, all the way to attached facility where the matrix extension uses its own set of registers. It’s different approaches we could take as part of the spec which both allow different advantages.

For example, the fully attached allows to have it as a fully coprocessor, so which allows us to be even more energy efficient, even faster and everything, like Tensor Cores. Basically, they are different things. What’s fascinating to me about all of this discussion is that all of these discussions are happening in public. You have people like Rivos, like Google, like NVIDIA, like all these people at RVI and discussing how to do matrix extensions together so that they can converge on the spec and make sure that the software of the world can support RISC-V and can support the matrix extension on RISC-V. There is actually two specs. It’s not impossible that both get accepted, which is perfectly fine, basically a different way of doing it.

All of the discussions that I just mentioned are happening on these mailing lists, and these mailing lists are open. If you’re interested, you should go check them out, see who is contributing, see who is looking at that. That’s something really specific to RISC-V. It’s really the nature, what makes RISC-V. It is not RVI making the spec. Again, I really want to stress that out. RVI is just like the conduit of the spec. It’s really the RVI members which are making the spec. If you are working in a company that wants to get involved in this space, you go to RVI, you become a member of RVI, and you’re able to influence the RISC-V spec.

What are the expected gains? Higher throughput, lower latency, better performance in general. Higher perf per watt with more efficient silicon. The difficulty is that it needs to be future-proof because, for example, we can say, we want to support FP32, FP16, FP8, INT8, and FP4, for example, or BF16. Then, let’s say tomorrow someone comes up with FP4 or FP2, who knows? We need to be able to, in the future, say, let’s have another extension that allows to add this other format. We really have to be future-proof in that because we don’t want to get ourselves into a space where we’re not able to evolve the spec because we know the formats are going to change. The algorithms also are going to change. Right now, it’s all about dense matrices. Maybe tomorrow it’s going to be about sparse matrices. We have to have enough flexibility into the spec to allow us to be future-proof and being able to adapt in the future.

RISC-V vs. Other Architectures

I want to come back a bit, RISC-V and the other architectures. I want to go into a small exercise of trying to compare them to highlight the strengths of each. I think it was on the AMD talk, and on the compute bound versus memory bound. This is really a big thing, especially on CPUs. Because on CPU, most CPUs only have access to DRAM, which is fairly fast, which is fine for compute bound, so compute bound is when the bottleneck is the matrix multiply, not getting the data to do the matrix multiply. Memory bound is when you just don’t get data fast enough. To alleviate this, the data is not coming fast enough, you have technologies like HBM, high bandwidth memory, which have bandwidths of around 1.5 terabits per second. It’s expensive. It has to live on the chiplet. It’s pretty hard. You cannot just extend it. That’s not possible. Then you have even more expensive but even faster with SRAM.

Then, where you can have around 90 gigs of HBM on a chiplet, like a chip this size approximately, SRAM, you can have 1 gig, 2 gig. It’s usually used for L2 and L3 things. It’s very expensive. You don’t have that much because it’s not dense enough. Here, RISC-V takes care of the compute. RISC-V is a compute spec. It’s not a memory spec. Nothing stops anyone to have a RISC-V CPU with HBM attached. What kind of memory is used does not matter to the software. It’s not part of the spec. You could imagine having these dynamic things where a vendor decides, I’m going to have a RISC-V CPU attached with HBM. That’s being done in the industry already.

The current RISC-V offering is very focused on NPU, Neural Processing Unit, also called TPU, xPU, AIPU, whatever you want to call it. It’s pretty generic. Just to give a few examples, Tenstorrent, for example, has the Grayskull as a PCIe card. SiFive has an offering with something equivalent to Attached Memory Extension, like a coprocessor. SpacemiT X60, which is a Chinese company, has something equivalent to IME, so integrated. It’s new instruction making use of the vector registers. You already have some offering for that. You can already get your hands on that and you can already start playing with that with the existing hardware.

What’s the advantage of RISC-V? I think the main strengths are flexibility, the open community, and the cost effectiveness. RISC-V is everywhere, from Bluetooth headphones to supercomputers. It’s even on NVIDIA GPUs. Again, like on these little processors that help the big things go forward. You can work with existing IP vendors. For example, you can go to SiFive, you can go to Andes, you can go to Microchip and say, I want to have a processor for this specific use case, for automotive, for space, for phones, for whatever. You are going to work with them and say, I want this IP block, this IP block, this IP block, and I’m going to make a CPU thanks to your help. That you can already do. There is a pretty healthy IP market already available. You can also buy whole CPUs designs, for example, from SiFive, and integrate very specific IP.

For example, Andes Custom Extensions, Codasip Studio, there is a few other offerings. This is not an endorsement, this is just availability of things. You’re able to take Andes CPU and you say, I want to add an instruction that does this specific C code. This C code is going to be transformed into silicon. If you have a workload, for example, that has this very hot kernel, you can literally have silicon for this kernel. “All you need to do” is just go to Andes, use their tool, and you’re going to work with them to have a CPU or a RISC-V core made for that specifically. If you notice that it’s taking 95% of your time you spend on this specific kernel, if you can accelerate it 10 times, 15 times, you reduce the cost by 10, 15 times. This openness eases up the software ecosystem enablement because it just makes it easy. Everyone knows what the spec is, so the hardware vendor can program against it.

The software knows what to use. I think it’s not too crazy to say that RISC-V is the new Linux in the sense that Linux really opened a lot of doors by being an open-source operating system. It just made hardware a lot more available to people because people could just go and hack into the kernel, which they couldn’t do with Solaris, with Windows, and everything. RISC-V really allows you to do the same thing of, you’re able to go and hack into the spec, to hack into the hardware. You have space into the spec to actually do what you need to do, and develop your own thing really custom to what you need to do.

The weakness, really, ecosystem maturity and developer tooling. A lot of work is going into software. The software is catching up, but it’s just not there yet. My personal expectation is that by the end of 2026 things should be in a much better state. We’re not going to be matching x86 and ARM performance by then because the compilers are just not going to be there, but I think it’s going to be like 97%, 98% of them will be there. The last 2% is going to take 20 years, but that’s another problem. Most critical libraries like on OpenBLAS, for example, and a few others will be optimized because there’s like 100 libraries. There are enough contributors to go do these 100 libraries in a few years.

The most commonly used package will also be available just because we have got enough people getting excited about this and making it available, but not everything will be available just yet. For example, Kubernetes right now is not available out of the box. As in, you cannot just go to kubernetes.io, download the distribution, and it just works. It doesn’t just work out of the box.

Broader Challenges in AI/ML with RISC-V

Same thing with Python, for example. Python is a big problem right now, especially with AI, there is just not Python packaging on RISC-V on pypi.org, just because the tooling doesn’t allow it. We are contributing through RISE to the PyPI, so Python Packaging Authority, so the group taking care of everything packaging for Python. We are making sure that it’s supporting RISC-V, but there is some dependencies which are not going to be available until next year, like Linux distributions. We are also working with these to make sure they have the offering we need. There is a lot of work in progress, but it just takes time to deliver. Another big thing that RISC-V can really help is on the energy efficiency. As we saw at the beginning, energy is a big concern, like just operating cost is huge. We need to have competitive total cost of ownership solutions. We need to develop the specifications to allow that, and we need to have the software using it.

Collaboration and Community Opportunities

If you are looking forward to collaborate with that, just some links, you have some technical forums, a mailing list, and GitHub for RISC-V. You have YouTube, RISC-V International. There’s a lot of very interesting presentations about that. Please get involved, that your software, hardware, everything, we are looking for everything. Something you can do on your day-to-day is port your favorite software to RISC-V. That’s just going to make it that much easier for future users to do it. Check the RISE Developers Appreciation program. If you port software to RISC-V, we are going to give you money. You can even run RISC-V software on your CI of choice, for example, GitHub Actions. I made a presentation on how we did it for a project, which highlights some of the steps we can do.

Reflections, and Open Questions

How can RISC-V close the gap with proprietary solutions? Openness and collaborations. That I think is really what’s going to set us apart. What are some of the open questions, like the balance between innovation and standardization? Again, vendor-specific extensions allow vendor-driven innovation, but standardization allows broader software adoption.

Conclusion

RISC-V modularity enables workload-specific optimizations. That’s really like the openness is, I think, key to RISC-V. OpenBLAS shows us how it’s possible, how we could tweak RISC-V to make even OpenBLAS go faster. Again, the collaboration is crucial in order to make it happen.

See more presentations with transcripts

Optimizing Custom Workloads with RISC-V

Transcript

AI’s Expanding Influence and Hardware Demands

Current Solutions and Their Limits

RISC-V: A Modular, Open Approach

Linear Algebra: The Backbone of AI

OpenBLAS Optimization Journey: Basics

OpenBLAS Optimization Journey: Vectorization

OpenBLAS Optimization Journey: Transpose

OpenBLAS Optimization Journey: Going Further

RISC-V vs. Other Architectures

Broader Challenges in AI/ML with RISC-V

Collaboration and Community Opportunities

Reflections, and Open Questions

Conclusion

Leave a Reply Cancel reply

Stay Connected

Latest News

Meet the ‘new Evel Knievel’ who shattered world record with 205ft-jump

Today's NYT Wordle Hints, Answer and Help for Aug. 9 #1512 – CNET

Best Mobile Phones Under Rs 15,000 in India: IQoo Z10X, Moto G85 and More

Boost Your Network With 28% Off This TP-Link Deco X25 Mesh Wi-Fi 6 System

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

AI’s Expanding Influence and Hardware Demands

Current Solutions and Their Limits

RISC-V: A Modular, Open Approach

Linear Algebra: The Backbone of AI

OpenBLAS Optimization Journey: Basics

OpenBLAS Optimization Journey: Vectorization

OpenBLAS Optimization Journey: Transpose

OpenBLAS Optimization Journey: Going Further

RISC-V vs. Other Architectures

Broader Challenges in AI/ML with RISC-V

Collaboration and Community Opportunities

Reflections, and Open Questions

Conclusion

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News