Netflix Uncovers Kernel-Level Bottlenecks While Scaling Containers On Modern CPUs

Engineers at Netflix have uncovered deep performance bottlenecks in container scaling that trace not to Kubernetes or containerd alone, but into the CPU architecture and Linux kernel itself. In a detailed blog post, Netflix technologists explain how their move to a modern container runtime exposed surprising contention on global mount locks in the kernel’s virtual filesystem (VFS), revealing that underlying hardware topology and lock contention can limit the scaling of hundreds of containers concurrently, even on powerful cloud servers.

The issue first surfaced as nodes running Netflix workloads began stalling for tens of seconds under high concurrency, with simple health probes timing out and container creation freezing. Investigations showed the mount table ballooning dramatically during the startup of many-layer container images, straining the kernel’s global mount lock as containerd executed thousands of bind mount operations to map user namespaces for each image layer. With every container requiring dozens of mounts and unmounts, the cumulative workload easily exceeded 20,000 mount syscalls during large bursts, all needing access to the same kernel lock, a classic concurrency bottleneck deep in the operating system.

Netflix’s performance team found that not all CPU architectures behave the same under this load. On older dual-socket AWS r5.metal instances (with multiple NUMA domains and mesh-based cache coherence), high concurrency accelerated contention on shared caches and global locks, severely degrading performance. By contrast, newer single-socket instances such as AWS m7i.metal (Intel) and m7a.24xlarge (AMD) with distributed cache architectures scaled much more smoothly, with fewer stalls even as container counts climbed. Analysis revealed that factors like NUMA effects, hyperthreading, and cache microarchitecture significantly influenced how global lock contention propagated through the system.

Netflix engineers confirmed that hardware design matters at scale: NUMA-induced remote memory access latency and competing hyperthreads exacerbated lock waits, while distributed cache designs reduced bottlenecks. For example, disabling hyperthreading improved latency by up to 30 % in some configurations, and single-socket instances avoided cross-domain memory penalties entirely. These experiments demonstrated that achieving reliable scaling for container-heavy workloads requires understanding both software concurrency and hardware behavior.

Armed with this insight, the team explored two major mitigations: adopting newer kernel mount APIs that use file descriptors to avoid global locks entirely, and redesigning how overlay filesystems are built so that the number of mount operations per container drops from linear in the number of layers (O(n)) to constant time per container (O(1)). Netflix chose the latter as it can be deployed more broadly without requiring newer kernels, eliminating mount contention in practice. By grouping layer mounts under a common parent, the mount load on the kernel falls dramatically, smoothing container startups even under high load.

Netflix also addressed the hardware side by routing demanding workloads toward CPU architectures that handle global locks more gracefully, combining hardware-aware scheduling with software improvements. Their findings highlight a broader lesson for organizations at scale: achieving predictable performance in distributed systems often demands co-design across the stack, from container orchestration and filesystem usage to kernel internals and CPU microarchitecture.

The Netflix team published the deep dive to share these performance insights with the broader engineering community, emphasizing that bottlenecks in modern cloud platforms can arise in places few developers typically consider, and that solving them may require both low-level system tweaks and a clear understanding of the hardware running your workloads.

Several organizations have published best practices that closely align with Netflix’s findings on container scaling and kernel-level contention. There is emphasis on hardware-aware workload placement, particularly understanding NUMA topology, cache coherence design, and hyperthreading behavior when running high-density container workloads. There are also best practices favoring single-socket architectures or carefully selecting instance families that minimize cross-domain memory latency, as well as using bare-metal or dedicated instances for system-intensive operations. At the software level, Kubernetes and container runtime communities advocate reducing global lock contention by minimizing mount and unmount operations, consolidating filesystem layers, and adopting newer kernel APIs where possible to avoid shared bottlenecks.

In addition, organizations such as Google and Meta emphasize deep, system-level observability as a core scaling practice, using tools like eBPF, perf, and flame graphs to detect hidden kernel stalls and lock contention under concurrency. Cloud providers also recommend leveraging local ephemeral storage for image caching, optimizing overlay filesystems, and tuning runtime configurations to reduce startup amplification effects. Together, these practices reflect a broader industry shift toward hardware-software co-design, where predictable container scaling depends not only on orchestration and runtime improvements, but also on understanding CPU microarchitecture, filesystem behavior, and kernel internals, the same cross-stack approach highlighted in Netflix’s analysis.