An open-source, independently developed Linux kernel module called GreenBoost aims to augment the dedicated video memory on NVIDIA discrete GPUs with system memory and NVMe storage. The intent here with GreenBoost is a CUDA caching layer to more easily run larger AI models for LLMs that otherwise won’t fit solely in your graphics card’s dedicated vRAM.
GreenBoost was announced today by independent open-source developer Ferran Duarri that is developing it as a multi-tier GPU memory extension for Linux. The GPLv2 driver doesn’t replace NVIDIA’s official Linux kernel drivers but is complementary to it in being a dedicated kernel module paired with a NVIDIA CUDA user-space shim library to transparently leverage it for the expanded memory access. This means doesn’t require modifying the CUDA user-space software itself but will transparently enjoy the expanded memory capacity thanks to your system RAM and any NVMe SSD storage.
The developer noted he wanted to run a 31.8GB model (glm-4.7-flash:q8_0) with a GeForce RTX 5070 12GB graphics card. Existing approaches like offloading layers to the GPU worked but dropped the token performance due to the system memory lacking CUDA coherence. Going for smaller quantization, of course, leads to lower quality as another option.
As for how GreenBoost works, today’s announcement in the NVIDIA Forums explains:
“1. Kernel module (`greenboost.ko`)
Allocates pinned DDR4 pages using the buddy allocator (2 MB compound pages for efficiency) and exports them as DMA-BUF file descriptors. The GPU can then import these pages as CUDA external memory via `cudaImportExternalMemory`. From CUDA’s perspective, those pages look like device-accessible memory — it doesn’t know they live in system RAM. The PCIe 4.0 x16 link handles the actual data movement (~32 GB/s). A sysfs interface (`/sys/class/greenboost/greenboost/pool_info`) lets you monitor usage live. A watchdog kernel thread monitors RAM and NVMe pressure and signals userspace before things get dangerous.
2. CUDA shim (`libgreenboost_cuda.so`, injected via `LD_PRELOAD`)
Intercepts `cudaMalloc`, `cudaMallocAsync`, `cuMemAllocAsync`, `cudaFree`, and `cuMemFree`. Small allocations (< 256 MB) pass straight through to the real CUDA runtime. Large ones (KV cache, model weights overflowing VRAM) are redirected to the kernel module and imported back as CUDA device pointers. There is one tricky part worth mentioning: Ollama resolves GPU symbols via `dlopen` + `dlsym` internally, which bypasses LD_PRELOAD on those symbols. To handle this, the shim also intercepts `dlsym` itself (using `dlvsym` with the GLIBC version tag to bootstrap a real pointer without recursion) and returns hooked versions of `cuDeviceTotalMem_v2` and `nvmlDeviceGetMemoryInfo`. Without this, Ollama sees only 12 GB and puts layers on the CPU.”
Those wanting to learn more about this GPLv2-licensed open-source GreenBoost implementation can find the experimental code via this GitLab repository.
