Airbnb engineering has published a detailed account of how it maintains high availability during Istio upgrades across tens of thousands of pods and thousands of VMs, all without downtime. The company’s service mesh infrastructure supports workloads in both Kubernetes and VM environments, handling tens of millions of queries per second at peak. Despite the complexity, Airbnb has completed Istio upgrades 14 times to date.
The key challenge lies in coordinating upgrades across diverse workloads owned by different teams. To address this, Airbnb designed an upgrade pipeline that “guarantees” zero downtime, enables gradual rollouts, supports failback, and ensures all workloads are updated within a fixed timeframe.
Technically, the process relies on a canary-style dual-version deployment of Istio control planes, each distinguished by a revision label (e.g., 1-24-5, 1-25-2). Workloads are pinned to specific revisions via the mutating webhook, which injects the appropriate istio-proxy sidecar. Upgrading artfully transitions select workloads to the new version based on distribution rules defined in a rollouts.yml file.
To eliminate manual label updates across numerous teams, Airbnb leverages Krispr, an internal mutation framework. During CI, Krispr injects the correct revision label into workload specs based on the rollout configuration. It also continuously migrates older pods via admission-time mutation, ensuring that within four weeks, all workloads transition smoothly, even inactive ones.
For VM workloads, Airbnb uses mxagent, a daemon that polls version tags on each host and atomically upgrades both the istio-proxy and its configuration when necessary. A central controller (mxrc) coordinates VM rollouts, respecting health checks and upgrade safety thresholds similar to Kubernetes’ maxUnavailable semantics.
Alongside Airbnb’s recent successful service mesh upgrades, other companies have approached the idea of service mesh upgrades a little differently:
Netflix has introduced its own zero-config service mesh. Instead of relying on a heavy control plane model, Netflix designed a mesh that automatically manages service discovery, retries, and traffic routing without requiring manual configuration. In doing so, Netflix sidesteps the coordination challenges of multi-version Istio upgrades, while still gaining the traffic management and reliability benefits that a service mesh provides.
LinkedIn, which runs one of the largest Kubernetes deployments, uses a mix of canary deployments and traffic mirroring for upgrades to core infrastructure, including Kafka and networking layers. For its service-to-service communication stack, LinkedIn has experimented with Envoy-based solutions but leans on gradual rollout pipelines with mirrored traffic for safety. This approach is conceptually similar to Airbnb’s dual Istio revisions: both allow traffic to be validated against new versions before flipping fully.
As one of the creators of Istio, Google Cloud itself has pioneered multi-revision control planes for customers, similar to Airbnb’s implementation. GKE now allows operators to run multiple Istio versions side by side, easing rollouts and failback. Google is also pushing Ambient Mode, which replaces sidecars with lightweight data-plane proxies, reducing upgrade blast radius significantly. Airbnb has expressed interest in Ambient Mode, signaling alignment with Google’s next-gen mesh direction.
Uber runs an internal mesh framework built on Envoy that integrates closely with its custom service discovery system. Their upgrade strategy often involves progressive deployment by cluster rather than fine-grained revision pinning. Uber has invested in tooling to automate rollback and enforce SLA monitoring during upgrades, somewhat mirroring Airbnb’s mxagent + mxrc setup for VMs.
These comparisons illustrate a broader industry trend: investing in advanced rollout frameworks or mesh innovations to balance complexity, reliability, and operational control.
Looking ahead, Airbnb plans to explore Istio’s Ambient Mode for a more lightweight mesh setup and splitting meshes to limit blast radius and enhance isolation.