Groupe SNCF, a major railway operator, has successfully migrated from traditional VM-based Kubernetes deployments to a cloud-native platform built on Talos OS and OpenStack, addressing significant operational challenges while navigating complex organizational change. After his talk at TalosCon 2025, InfoQ interviewed Thomas Comtet, Senior Staff Engineer, about this migration.
The organization’s Kubernetes journey began in a highly restrictive DMZ landing zone with limited services and mandatory Virtual Machine (VM) usage. This initial implementation, built from scratch on existing VMs, became what the team described as a “monster” that was extremely difficult to maintain and operate.
When the project expanded to a more traditional intranet zone with standard VLANs and services, the team took a fundamentally different approach. Rather than simply deploying another Kubernetes distribution, they architected a comprehensive cloud-native platform addressing all gravitational concerns: networking, load balancing, storage, and operations.
The solution combined OpenStack as the private cloud foundation with Talos OS as the Kubernetes operating system. This architecture provided the automation capabilities needed for dynamic storage provisioning, load balancing, and network subnet manipulation from day one.
The most significant hurdles were organizational rather than technical. Introducing cloud-native concepts to teams accustomed to traditional IT operations required a fundamental shift in mindset. Legacy teams excelled at scripting, ticket-based workflows, and reactive operations, but cloud-native practices emphasized immutable infrastructure, GitOps, and atomic rollbacks.
Rather than attempting to retrain existing teams, the organization created new teams aligned with cloud-native principles, allowing both approaches to coexist autonomously. This decision acknowledged that changing deep-rooted operational habits and perspectives requires more than training it requires different organizational structures.
The technical implementation presented its own challenges. The OpenStack team was still maturing when the Kubernetes platform launched on top of it, creating a demanding client relationship from the start. The cloud-native team required sophisticated capabilities immediately: automated storage, dynamic load balancing, and subnet manipulation.
When we started, OpenStack was brand new and still being deployed. We immediately built our entire Kubernetes cloud-native platform on top of it: automated storage, load balancing, subnet management, everything. We weren’t easy clients with simple needs. Both teams were running in parallel: them deploying OpenStack, us building the cloud-native platform on top.
This necessitated extremely close collaboration, with teams maintaining constant communication about changes and their impacts. Despite the challenges, this tight integration ultimately strengthened both platforms.
For the Kubernetes-focused team, Talos OS proved ideal. Most team members were Kubernetes experts rather than operating system specialists, and Talos provided a production-ready, secure-by-design solution out of the box. Two engineers working with Talos daily particularly appreciated its configuration-driven approach and minimalistic design.
Reflecting on the journey, the team identified one significant improvement opportunity: the two-year research phase exploring bare-metal Kubernetes solutions. The team spent considerable time committed to a Kubernetes-centric approach before eventually pivoting to the OpenStack-based solution. This transition might have occurred six months to a year earlier with more openness to alternative perspectives.
However, the organizational challenges of working with legacy teams were unavoidable. Cultural and operational transformation simply requires time and cannot be rushed.
The immediate roadmap focuses on scaling the existing platform and onboarding more applications to achieve return on investment. The next milestone involves migrating highly critical applications to the cloud-native platform, demonstrating confidence in its stability and capabilities.
Edge deployment decisions remain under evaluation given the long-term nature of industrial rollouts and the diversity of potential edge locations within railway operations.
