Transcript
Berenberg: One Network sounds very unique in an area of time when everybody is building more things, and we were not an exception. As Google, we were latecomers to the cloud computing game. With that, our team started building, and within a few years, by 2020 we’ve organically grown to more than 300 products with a lot of infrastructure, with multiple network paths. Our customer noticed that the products were not integrated, and we noticed that our developer velocity is actually low because every time we need to release a new feature, we need to release it in n times on every network path on a different infrastructure.
The most scary part and important part was the policy. Policy is something that cloud providers worried about day and night because all the cloud infrastructure is controlled by policy. How do you make sure that policies are being controlled on every path without any exception, without needing to modify these 300 products?
Why Is Networking Complicated?
Let’s look at why networking is complicated. On the left, you see Google Prod Network. Google has actually its own network as you know, and they’re running Search, YouTube. On that we’ll build some cloud products, for example here, Borg is a container orchestration system, on it we built Cloud Run. On the top of Cloud Run there is its own networking. Then there’s virtual networking on GCP itself which is called Andromeda. On top of it we build different runtimes like a Kubernetes GKE and GCE which is a compute engine with the VMs.
Over there there’s, again, networking its own, and GKE, as you know Kubernetes has its own layer of networking. On the top of it there are service meshes. Then the same thing applies on multi-cloud or the customer premises or distributed cloud, where layers happen. Then, what happened? In each environment we build applications, and these applications, again, they run on different runtimes, they have different properties. This combination of infrastructure on different network paths and different runtimes created an n-squared problem of what I usually call Swiss cheese: something works here, something doesn’t work there. What’s the solution? The solution is what we called One Network. It’s a unified service networking overlay.
One Network (Overview)
What is the goal of One Network? We say we want to define policy uniformly across services within the constraints of what I just explained, that it had compute heterogeneous networking, different language runtimes, coexistence of monolith services and microservices. It’s all across different environments, including multi-cloud, other clouds, public and private clouds. The solution is One Network, which is like frustration. You can think about, why do I need so many networks? Can I have one network? One Network it is. Policy managed at network level. We iterate towards One Network because it’s such a huge architectural change to manage cost and risks. It’s very much open source focused, so all the innovation went into open source and some went to Google Cloud itself.
How do you explain One Network? We’ll build on one proxy. Before that, every team and basically a lot of proxy war floating around. One control plane to manage these proxies. One load balancer that is wrapped around the proxy to manage all runtimes, so both GKE, GCE, Borg, managed services, multi-cloud. The Universal Data plane APIs to extend the ecosystem, so we can extend with both first-party and third-party services. Uniform policies.
Again, it’s across all environments. When I presented this particular slide in 2020, everybody said it just sounds too good to be true. It was at that time. Who is going to benefit from One Network? Everybody actually. These are the roles that we put together who benefit from One Network. They vary from people who care about security policy, to DevOps and networking, care about network policy, to SREs who care about provisioning large number of microservices or services. To application developers who actually want to worry about their own policy without needing to interact with the platform admins or platform engineering folks. There’s the width, the depth, and the isolation at the same time, so it’s a partition. As well as universal. Everybody cares about orchestration of large environments and everybody cares about observability.
One Network Principles – How?
What are the principles we built One Network on? We build on five principles. We’re going to build on common foundation. Everything as a service. We will unify all paths and support all environments. Then we create this open ecosystem of what we call service extensions, which are basically pluggable policies. We then apply and enforce these policies on all paths uniformly.
1. Common Foundation
Let’s start with the first one. This is One Network pyramid I put together because I was thinking about how to explain the narrowing scope of any layers. We start with the common purpose Envoy proxy, and we’ll talk more about it. It’s an open-source proxy available both on GCP or on-prem anywhere. Then we wrap it around and build around it GCP-based load balancers which actually work both for the VM, containers, serverless. You can build on top of it GKE controller, and now you have GKE gateway that uses the same underlying infrastructure but now only serves GKE services and workloads, and it understands behavior of GKE deployments.
The top of the pyramid is where you actually don’t see gateway at all because it’s fully integrated into Vertex AI, which is our AI platform, for example. It’s just implementation detail. All of that is using the same infrastructure across the products and across the path. All of these layers are controlled by a single control plane which we call Traffic Director. It has a formal API and everything. When I say single, it actually doesn’t mean a single deployment, it’s the same control plane that could be run regionally, or globally, or could be specialized per product if there is a need for isolation. It’s the same binary that runs all over, and so you can control it and orchestrate it the same way.
This is One Network architecture, Northstar. I want to walk you a little bit from the left to the right, you can see different environments. It starts from mobile going to the edge, then to the cloud data center, then on to the multi-cloud or on-prem. There are three common building blocks, there is a control plane, Traffic Director, that controls all of these deployments. There is open-source APIs between the Traffic Director and the data planes that call xDS APIs. Then there are data planes. Data planes are all open source, they’re open source based. They’re either Envoy or they’re gRPC, both of which are open-source projects. Having this open-source data plane allows us to extend both to multi-cloud, on to mobile, and basically anywhere outside of GCP, because it’s no longer proprietary. Talking a little bit about Envoy proxy, it came out in 2016, and we really like it.
The reason we like it is because it was a great, new, modern proxy with the all advanced routing, observability as first class. It got immediate adoption. Google heavily invested in it. The reason we like it, it’s not because it’s a great proxy but because it’s a platform with these amazing APIs. It has three sets of APIs. It has configuration APIs between the control plane and the data plane in the proxy itself, that configures its eventually consistent APIs. They provide both management plane and control plane functionality. There are data plane generic APIs. There is an external AuthZ that does allow and deny, and so you can easily plug in any AuthZ related systems. There is an API that’s called external proc, so basically you can plug in anything and it can modify behind it, you can modify the body, and then return it back. It’s very powerful. Then there is WebAssembly binary APIs for the proxy Wasm.
There are also specific APIs. Envoy had right away, RLS, which is rate limiting service, which is interesting to see that it could have been achieved via external AuthZ, which is more generic because it’s also allow and deny, but yet over here it was specialized. We’re seeing that in the future we’re going to have more of this specialized API. For example, we’re thinking of LLM firewall API when you think that incoming traffic, you can classify it as AI traffic, you can apply rules that are specific to AI traffic. You can do safety check. You can do block lists. You can do DLP. The data plane itself also has filters.
The Envoy proxy has filters, both L4 which is TCP, and L7 HTTP filters. There are two types of them, one of them is linked to Envoy and that changes ownership. If as a cloud provider we link, then it means it could be only our filters, and if a customer runs it on its own, then it’s theirs. We cannot mix and match. WebAssembly filters, it’s a runtime that you can actually have both first-party and third-party code loaded into the data plane. Google heavily invests into open-source proxy Wasm project, and actually we released a product. These filters could be changed, they could be either request based, response based, or request-response depending on how you need to process it. All of that is configured by Traffic Director.
Talking about Traffic Director, it’s xDS Server. What it does, it combines two things. It combines very quickly changing dynamic configuration, for example, the weights and the health with static configuration as how you provision a particular networking equipment. That magic we put behind Traffic Director, we call it GSLB, the Google Global Service Load Balancer. It’s a global optimized control plane. It’s the same algorithm that Google uses to send traffic for Search, YouTube, Gmail, anything you do with Google uses this load balancer behind it.
Global optimizes RTTs and the capacity of the backend. It finds the best path and the best weight to send to. It also has a centralized health checking so you don’t need to do n-squared health checking from the data plane, because one time we actually noticed that if you do n-squared health checking you’re going to end up with 80% of throughput through the data center being just health check only, leaving only 20% for actual traffic. Removing that 80% overhead is great. Also, it’s integrated with autoscaling so when traffic burst occurs you don’t scale up step by step, you can just scale up in a single step because you know how much traffic is coming, and in the meantime, traffic is being redirected to the closest. Traffic Director handles here policy orchestration, because when administrator creates policy it delivers it to Traffic Director, and then the Traffic Director provisions all the data plane with this policy where they enforced.
2. Everything as a Service
The second principle is everything as a service. This is actually a diagram of a real service. It’s cloud service internal. Think about how do we manage such a service? There are different colors, they mean something. There are different boxes, they mean something. The lines are all over. How do you reason about such an application? How do you apply governance? How do you orchestrate policy across this? How do you manage these small independent services or how do you group them into different groups? One Network helps here.
Every of these microservices is imagined as a service endpoint, and it enables this policy to orchestrate and group these service endpoints and orchestrate over them without actually touching services themselves, so everything is done on a network. There are three types of service endpoints. There’s a classic service endpoint of customer workload, where you take a load balancer, you put it in front, you got a service endpoint. You hide, for example, shopping cart service in two different regions behind it. That’s typically how service is being materialized.
The second is a newer one where there is a relationship between a producer and consumer. For example, you have a SaaS provider who builds a SaaS on GCP and then expose it to the consumer via a single service endpoint that is materialized via this product that Google Cloud has called Private Service Connect. There’s a separation of ownership, and the producer doesn’t have to expose their architecture out to the consumer. Consumer doesn’t know about all this stuff, that is producer running. The only thing they see is a service that is endpoint, and they can operate it. In this case you can think about the producer being a third party that is outside of your company or even your shared service.
If you have a shared service within your company and you want multiple teams to use it, this is a type of architecture you want to do because you want to separate your own implementation from consumption, and then allow every customer or consumer to put their policy on a service endpoint. You can expose a service endpoint through a consumer or expose a single service or as many as you want. There’s also headless services. These are typically defined by service meshes where they’re within a single trust domain.
In this case, services are just materialized as abstractions because there is no gateway here, there’s no load balancer. Each of them is just a bunch of IP ports of the backends. An example of this is AI obviously. We’re looking at a model as a service endpoint. Over here the producers are model creators, and the consumers are GenAI application developers. The producer, for example, the inference stack is hidden behind a Private Service Connect, so nobody even knows what it’s doing there. Then a different application connects to this particular service endpoint.
3. Unify All Paths and Support All Environments
The third one is to unify paths and environment. Why would we want to do that? It’s to apply uniform policies across services. To unify paths, we first have to identify paths. You can see here eight paths that we identified. This is a generalization actually, there are lots more, but we generalized them to eight.
Then, for each of them we identify a network infrastructure that you have to implement the path to apply the policy. You can see there is both external load balancer for internet traffic, internal load balancer for internal traffic, service meshes, egress proxy, even mobile. Let’s start looking at one at a time. GKE gateway and load balancer, typically that’s how services are materialized. What we did is we involved Envoy which was original deployment to become a managed load balancer, and we spent more than a year of hardening it in open source to become available for internet traffic. We also have global and regional deployment.
Then global deployments are used for customers who have global audience or who actually care for cross-regional capacity reuse or in general they need to move around, versus regional deployments for customers who care about data residency, especially data in transit or looking at the regionalization as the isolation and reliability deployment. We provide both. It’s all connected to all runtimes.
The second deployment here is a service mesh. Istio is probably now the most used service mesh. The most interesting part about them is that they very clearly define what is service-to-service communication needs. It needs service discovery. It needs traffic management. It needs security. It needs observability. Once you separate this into these independent areas, it’s easy to plug each of them independently. For Google product, we have cloud service mesh, which is Istio based, but it’s backed by Traffic Director and also by gateway APIs as well as Istio APIs. It works for VMs, container, serverless. That is out.
Then, Google had service mesh more than 20 years ago, since forever, before service meshes were a thing. The difference between Google service mesh and Istio or any other service meshes, that we had it proxyless. We had a proprietary protocol called Stubby. We had a control plane that goes to Stubby. We provision Stubby with the configuration and everything. It basically was a service mesh the same way as you see it now. We expose this proxyless service mesh notion to our customers and to open source, where gRPC used the same APIs to the control plane, as Envoy.
Obviously, that allows us to reduce consumption because you have extra proxy. It’s basically super flat without having any maintenance, because you don’t need to install a proxy, there is no lifecycle management, there is no overhead here. Similar but a little bit different deployment architecture is GKE data plane v2, which is based on Cilium and eBPF in a kernel. It simplifies GKE deployment networking. Scalability is improved because there is no sidecar. Security is always on, and built-in observability. For L7 features, it automatically redirects it to the L7 load balancer in the middle.
Mobile, actually we didn’t productize. This is a concept but we ran it for a couple years, and it’s very interesting. This extends One Network all the way to mobile, and it brings very interesting behaviors, because first, as opposed to the workloads or computer centric, cannot actually have persistent connection to the control plane due to power consumption. The handshake is a little bit different. Also, that would require Traffic Director to build a cache to be able to support. We tried on 100 million devices, or a simulation of devices, and so that actually worked very nicely. It uses Envoy Mobile. This is evolution of Envoy proxy that is typically used, to Envoy Mobile, which is a library that is being linked into the mobile application.
One of the interesting use cases here, is, if you have 100 million devices and one of them goes rogue or you need to figure out, having a control plane would allow you to identify that particular one, deliver a configuration to that particular one, and get the observability or whatever you need, or shut down that particular one. The value is there. The second project that is also a work in progress is control plane federation. Now you can think about multi-cloud or the on-premises where a customer outside of GCP is running pretty much similar deployment but it’s not GCP managed. Now you’re running your own Envoy or you’re running your own gRPC proxyless mesh with the local Istio control plane. In this architecture, we are using local Istio control plane to do this dynamic configuration, to make sure that the health of the backends is being propagated, so if the connection between Traffic Director and the local xDS control plane breaks, then the deployment on-prem or the cloud can function just fine, until they reconnect.
Then, you still have a single pane of glass and you can manage multiple numbers of this on-prem or multi-cloud deployments or point of sale deployments. You can imagine those can go into the thousands from a single place. Bringing it all together, this is what it looks like. We already looked at this picture. You have all the load balancers and the mesh, and they go across environment including mobile and multi-cloud.
4. Service Extension Open Ecosystem
That was the backbone. How do we use the backbone to enable policy-driven architecture? We introduced the notion of service extension. For each API that we discussed before, whether it’s external AuthZ to do allow and deny, whether it’s external processor, at every point, there is a possibility of plugging these policies in. The examples here are, for example, a customer wants to have its own AuthZ. They don’t like AuthZ that we provide to them. They will be able to plug in. Or, another example is Apigee, which is our product for API management. Having service extensions changes paradigm of how API management is actually done, because previously you would need a dedicated API gateway to do API management.
Over here, API management becomes ambient. It’s going to be available because One Network is so big, and you can plug at any point. The same API management will be available at any point, whether it’s on edge, service-to-service communication, on egress, on mesh. You have this change from a point solution to the ambient presence of the policies or the value services. Another example here is a third-party WAF. We have our own WAF, but our competitors can bring their own WAF. It’s open ecosystem. The customer may be able to pick and choose which WAF to use on the same infrastructure. They don’t have to plug additional things in and then try to tie it together. It’s all available.
The One Network architecture is all there. Before, we discussed how it looks at one point, and now you can see how you can plug this at any point everywhere, whether it’s routing policies, whether it’s security services, whether it’s API management or traffic services. How does it work, actually? We have three types of service extensions. One of them is service plugins, Wasm-based. It just went public preview. Then there is Callout. That is serverless, essentially. You give the code, and we run it for you.
Typically, people like to have it at the edge, where you can immediately do header manipulation and other small functions. Then you have Service Callouts, which are essentially SaaS that have been plugged in. Over here, there’s no restriction on size, ownership. It’s just a callout to a service. Then, what we call PDP proxy. It is an architecture that allows plugging multiple policies behind a proxy for the caching. Not just for the caching, and then you can do manipulation of this policy and that policy, then do this. It’s like operating on multiple policies.
Each of these policies are represented as a service tool, and they’re managed by AppHub, which is our services directory. Going into the future, we’re looking into managing all of these through the marketplace and have lifecycle management, because there’s going to be a lot of them. The question is going to be how one is going to pick this WAF versus that WAF. You need to have a recommendation system. You need to worry about which one to choose.
5. Apply and Enforce Uniform Policies
The last, but not the least, is how do you apply and enforce uniform services? We actually came up with a notion of four types of orchestration policies. One of them is creation of segmentation. If you have a flat network and you want to segment it, you can segment it by exposing one service and not publishing others. That causes traffic to go through a chokepoint, and over there, you can apply policies. It’s easy to do because everything is a service, so now it’s easy to manipulate which services are visible, which services are not visible.
The second one is to apply a policy to all paths. What actually happened is that every application has multiple paths leading to it. For example, it’s a very common deployment to have internet traffic and internal traffic going to the same service, to the same endpoint. When something happens, how do you quickly apply policy to all paths? Because you’re protecting this application behind it, the workload behind it, versus worrying about, did I cover all paths or not? You should give away that knowledge to the programmatic system that can go and orchestrate over the paths.
The third one is to apply policy across an application. One application defines a perimeter, a boundary in which all these services and workloads reside. A typical business boundary, for example, an e-commerce application contains a frontend, a middle-tier, a database. Then a different application contains catalog and other things. Then, one application can call into the service of another application. Within a given application, a security administrator can say, on the boundary, there are these policies.
For example, nobody but this service can talk to the internet. Everything inside of the application cannot talk to the internet. That means the policy needs to be applied to each of the workloads, for example, not to have public IPs or not to allow egress traffic. The fourth one is to deliver policy at the service level. That is management at scale. Imagine that you need to provision every VM with a firewall or some configuration. With that, you have a thousand of them. Instead, you can group this VM into a single service, set up a policy on the service, and then orchestrate it on each individual backend.
This is how the policy enforcement and policies administration is done through this concept of policy administration point, policy decision point, and policy enforcement point. We spoke about One Network data planes that are policy enforcement point. The policy providers provide service extension and One Network administration points. Basically, what it does, it allows customers to provide policy of larger granularity of this group, for example, of application or workloads, to allow for the orchestration. Let’s take a couple examples. How does it work? There’s the notion of service drain in Google.
Imagine it’s 3:00 at night and you got paged. What are you going to do? You drain first and think second. That is how Google SREs operate. What does drain mean? You administratively move traffic away from where you got paged or from that service. What are you going to drain? You can drain a microservice. You can drain in a zone or in a region. The microservice could be VMs, could be containers, could be serverless, a set of microservices. The traffic just moves away administratively. Nothing gets brought down. It’s still up, it just no longer receives service. Your mitigation is done because, actually, traffic moves. Hopefully, the other regions are not affected. You’re free to debug whatever happened with the running deployment. Then once you’ve found a problem, you undrain, slowly trickle traffic back, more and more, and get back to a working setup, and wait until the next 3 a.m. page happens.
How does it work with One Network? You can see traffic incoming through different data planes, through application load balancer, through service mesh, with gRPC proxyless mesh, or with the Envoy mesh. They’re all going for region 1 now. Then we apply the drain via xDS API, and traffic moved across all of them at the same time. Over here, we just showed you fully moved, but you can imagine moving 10% of traffic, moving 20%, however many percentages of traffic you need to move. Or you can drain right away.
Another example is CI/CD Canary releases, where we want to direct traffic to a new version. You can see here there are different clients that are actually humans that go through the application load balancer through some website. There’s a call center that is going through internal load balancer, point of sale going through the Envoy sidecar service mesh, and even multi-cloud on-prem going, for example, from the proxyless mesh. There are two versions of wallet service. There’s v1 and v2. We provision at the top, and it delivers configuration, and off we go. The traffic moved to the v2.
One Network of Tomorrow
One Network of tomorrow, where we are today, bringing it all together. It’s the same picture. We’re basically done with the central part. Multi-cloud: we have connectivity, and we extended it to the multi-cloud. We are working on the federation. We have edge, obviously. We are working on mobile. It is a multi-year investment. The Envoy’s One Proxy project started in 2017. One Network started in 2020. Our senior level executives committed to a long-term vision. The effort spans more than 12 teams. We have, so far, delivered 125 individual projects. Majority of Google Cloud network infrastructure supports One Network. Because it’s all open-source based, then open-source systems are available to be plugged in there.
Is One Network Right for You?
Is One Network right for you? The most important thing to consider is, do you have executive support or not to do something like that? I wouldn’t recommend anybody to do it on their own. Otherwise, the organizational goals need to be also considered. Is the policy something that your company worries about a lot? Is the compliance something that is super important? Multi-cloud strategy, developer efficiency that is related to the infrastructure. That is something important to consider when embarking on such a huge project. Plan for long-term vision, but execute on short-term wins.
That basically turned out to be a success story because we went without this big outcome. We were just doing one project at a time and improving networking one project at a time, closing bad holes in a Swiss cheese. We didn’t talk much about generative AI here. That’s why we decided to ask Gemini to write a poem for One Network and draw the image. Here it is. It’s actually a pretty nice poem. I like it. Feel free to read it.
Questions and Answers
Participant: With all these sidecar containers, Envoy running on all these services, what kind of latency is introduced as a result of adding all these little hops?
Berenberg: They are under a microsecond, when they’re local. It’s how network operates. We didn’t introduce anything new.
Participant: Today there are firewalls, load balancers, but you’re also now adding an additional layer of proxy beside each service, which doesn’t exist today.
Berenberg: No, we didn’t. What we did, we normalized all the load balancers to be Envoy-based. We normalized all the service meshes to be Envoy-based plus gRPC-based. Whatever people were running, we continued to run. We just normalized what kind of network equipment we are running.
Participant: For organizations that don’t already have or use service mesh, introducing this where before I can communicate with service A to service B, is just service A to service B. Now it’s service A, proxy, proxy, service B.
Berenberg: That’s correct. That’s considering Envoy proxy, as a sidecar it introduces latency. That’s why proxy was gRPC. It doesn’t introduce any latency because it only has a sidecar channel to the control plane. You don’t have to change context to anything. It’s under 1 microsecond, I believe so.
See more presentations with transcripts