Evaluating And Deploying State-of-the-Art Hardware To Meet The Challenges Of Modern Workloads

Transcript

Weekly: I’m Rebecca Weekly. I run infrastructure at Geico. I’ll tell you a little bit about what that actually means. I’ll introduce the concept, and ultimately how we made our hardware choices for the efforts that we’re going through. I’m going to give the five-minute soundbite for what Geico is. Geico was founded in 1936. It’s been around a very long time. The sole purpose of what we live to do is to serve our customers, to make their lives better when things go bad. How do we do that? What does that look like? Just so you have a sense, we have over 30,000 employees. We have over 400 physical sites from a network perspective.

My team is responsible for connecting all of those different sites from a networking perspective. Then we have our primary regional offices. Those are 21 offices. Those are also connected. Then we have six on-prem data centers today. We have a pretty massive cloud footprint. My team owns the hybrid cloud across that experience, from the vended compute and storage down. I do not own any aspect of the Heroku-like platform stack. I really am at the vended compute and vended storage down across the clouds. That’s where I try to focus.

Geico – Infra Footprint

In order to tell you how I got to making hardware purchases, I want to take you back through how Geico ended up with an infrastructure footprint that we have today. In 2013, Geico, like many enterprises, made the decision to go all in on the cloud. We are going to do it. We’re going to exit our on-prem data centers and we’re going to go all in on the cloud. At that time, that’s when we had our six physical data center sites and all of our regional footprint, as I mentioned before. The reason why was maybe an interesting thought process. I’ll share it a little bit. I wasn’t there. What my understanding of the decision was, was a desire to move with more agility for their developers. They were feeling very constrained by their on-prem footprint in terms of their developer efficacy.

The thought was, if we go to the cloud, it has these fantastic tools and capabilities, we’re going to do better. This does not sound like a bad idea. The challenge was, they didn’t refactor their applications as they went to the cloud. They lifted and shifted the activities that were running on an SVC style storage SAN and on an on-prem footprint with old-style blade servers and separate L2, L3 network connectivity with subdomains all over, and a network segmentation strategy that really is something to behold. They moved that to the cloud, all in, 2014.

Fast forward to 2020, at that point, they were 80% cloud-based. Nearly 10 years into the journey, got almost all the way there. Prices had gone up for serving approximately the same load by 300%. Reliability dropped by two nines because of their surface area. I love the cloud. I worked at a cloud service provider. I am a huge fan of the cloud. Please take this with all due respect. It had nothing to do with a singular cloud. It had everything to do with how many clouds were selected. Every line of business chose their cloud du jour, generally associated with the anchor that they had. If they were using EDW, they’re going to end up in IBM.

If they were using Exadata, they’re going to end up in Oracle. This is how you end up with such a large proliferation and surface area, which when you add the complexity of wanting to have singular experiences for users, not shipping your org chart, creates a lot of latency, a lot of reliability challenges, egress fees, all sorts of fun, not-planned-for outcomes, which is how you go from a fairly flat compute load over a period of time, but get a 300% increase in cost. Not ideal, and a very large reliability challenge.

As I mentioned, not one cloud, not two clouds, but eight clouds. All the clouds. Including a primary cloud, absolutely. One that had more than half of the general spend, but every other cloud as well, and PaaS services and SaaS services, and many redundant services layered on top to try and give visibility or better composability to data. You just end up layering more things on top to try to solve the root problem, which is, you’ve increased your surface area without a strategy towards how you want to compose and understand and utilize your data to serve your customers. Should we start at the customer? That’s what our job is, is to serve them.

At that time, just to give you a sense of what that footprint became, just one of our clouds was over 200,000 cores and over 30,000 instances. Of that cloud, though, our utilization was on average 12%. In fact, I had scenarios where I had databases that had more utilization when doing nightly rebuilds than in the actual operation of those databases. Again, it’s a strategy that can be well done, but this is not the footprint of how to do it well.

What Changed?

To really dig into why, what changed, and how we got on this journey to look back and look at our infrastructure as a potential for core differentiation and optimization. One was the rising cloud costs. 2.5% increase in compute load over a 10-year period. 300% increase in cost. Also, when you look at the underlying features, there was one cloud in which we were spending $50 a month for our most popular instance type, was a VM instance type. That was actually running on an Ivy Bridge processor. Does anybody remember when Ivy Bridge was launched? 2012. I’m paying $50 a month for that instance. That is not even a supported processor. The kinds of fascinating choices that was the right choice when they moved, that was a current processor type.

Once people go to a cloud, they often don’t upgrade their instances, especially VMs, which will be more disruptive to the business, to the latest and greatest types. You end up with these massively slow, massively overprovisioned, potentially, to actually serve the business. Rising cloud costs, that’s how we got there. Number two, the premise of technology and developers being unlocked didn’t pan out. Why didn’t it pan out? Lack of visibility. Lack of consistency of data. Lack of actual appetite to refactor the applications. Lifting and shifting. We did things like take an ISV that was working on-prem, migrate it to the on-prem cloud, then that ISV later created a cloud offering, but we couldn’t take advantage of it because of where we were licensed and how we had built custom features on top of that. This is the story of every enterprise.

Every enterprise, whether you started with an ISV or an open-source stack, you start to build the features you need on top of it, and then it becomes very disruptive to change your business model to their managed service offering, their anything in that transition over. Unfortunately, it became even harder to develop for actually giving new services and features because we now had to add the elements of data composition, of SLOs across different clouds, increasing egress fees. The last thing was time to market. I just hit on it.

Ultimately, as a first-party tech group, our job is to serve a business. Their job is to serve our customers. If I can’t deliver them the features they need because the data they want to use or the model they’re looking at is in this cloud or this service model and my dataset is over here, I can’t say yes. The infrastructure was truly in the way. It was 80% of cloud infrastructure. That was where it was like, what do we have to do? How do we fix this to actually move forward?

The Cost Dilemma

I gave some of these numbers and I didn’t put a left axis on it because I’m not going to get into the fundamentals of the cost. You can look over on that far side, my right, your left, and see, at our peak, cost structure when we made the decision to look at this differently. We still had our on-prem legacy footprint. That’s the dark blue on top, and our cloud costs. You can do the relative more than 2x the cost on those two because, again, we hadn’t gotten rid of the legacy footprint, we couldn’t. We still had critical services that we had not been able to evacuate despite 10 years. Now, how am I changing that footprint? The green is my new data center that came up in July, and another new one that we’re starting to bring up on. You’re seeing part of the CapEx costs, not all the CapEx costs, in 2024. The final edge of the CapEx costs in 2026.

Then the end state over here is the new green for a proposed percentage. You all know Jevons paradox is such that as you increase the efficiency of the computation, people use it more. My assumption is the 2.5% growth rate, which is absolutely accounted for in this model, is not going to be the persistent growth rate. It’s just the only one that I can model. Every model is wrong, it’s just hopefully some are informative. That was the attempt here in terms of the modeling and analysis.

The green is the net-new purchase of CapEx. The blue is the cloud. I keep hearing people say we’re repatriating. We are looking at the right data to run on-prem and the right services to keep in the cloud. You can call that what you would like. I’m not a nation state, therefore I don’t understand why it’s called repatriation. We are just trying to be logical about the footprint to serve our customers and where we can do that for the best cost and actual service model. That is in perpetuity. We will use the cloud. We will use the cloud in all sorts of places. I’ll talk about that a little bit more in the next section.

As an insurer and as many regulated industries have, we have compliance and we have audit rules that mean we have to keep data for a very long time. We store a lot. How many of you are at a cloud service provider or a SaaS or higher-level service? If you are, you probably have something like an 80/20, 70/30 split for compute to storage.

Most people in financial services are going to look at something more like a 60/40 storage to compute, because we have to store data for 10 years for this state, 7 years for that state. We have to store all of our model parameters for choices around how we priced and rated the risk of that individual. We have to be able to attest to it at any time if there’s anything that happens in any of those states for anybody we actually insure. We don’t just have auto insurance. We are the third largest auto insurer. We also have life, motor, marine, seven different lines of business that are primaries, and underwriting. Lots of different lines of business that we have to store data associated with for a very long time. That looks like a lot of us. We have a lot of storage.

Storage is very expensive in the cloud. It is very expensive to recompose when you need to at different sites. That entire cycle is the biggest cost driver of why that cloud is so expensive. I want to use the cloud where the cloud is good, where it’s going to serve my end users with low latency in all the right ways.

If I have somebody who is driving their car as a rental car in Italy, I don’t build infrastructure anywhere outside of the U.S. I need to have the right kinds of clouds for content dissemination at every endpoint where they might use their insurance or have a claim. There’s always going to be cloud usage that makes perfect sense for my business. This is not the place where we want to be guardrailed only to that, because it’s a regulated low margin business. Reducing my cost to serve is absolutely critical for being able to actually deliver good results to my end users. This is the cost dilemma that I was facing, that our leadership team was facing, that as we looked at it, we said, this is what we think we can do given the current load, given what we know about our work. That’s how we got the go ahead.

Hybrid Cloud 101

I’m going to start with, how do we make a decision? How do you actually do this at your company? Then, go and delve into the hardware, which is my nerd love of life. First, alignment on the strategy and approach. I just gave you the pitch. Trust me, it was not a four-minute pitch when we started this process. It was a lot of meetings, a lot of discussions, a lot of modeling, seven-year P&L analyses, all sorts of tradeoffs and opportunities and questions about agility.

Ultimately, when we got aligned on the strategy and approach, then it’s making sure you have the right pieces in place to actually drive a cloud migration and solution. You got to hire the right people. You got to find the right solutions for upskilling your people that want to do new things with you. That was not a small effort. There’s lots of positions within Geico tech across the board. That has been, I think, an incredible opportunity for the people coming in to live a real use case of how and why we make these decisions. That’s been a lot of fun for me personally is to build a team of awesome people who want to drive this kind of effort.

Number three, identify your anchor tenants and your anchor spend. What do I mean by that? I’ve now used the term twice, and I’m going to use it in two different ways. I’m going to use it this way for this particular conversation. Your anchor services or anchor tenants are the parts of your cloud spend you don’t want to eliminate. These are likely PaaS services that are deeply ingrained into your business. It may be a business process flow that’s deeply ingrained, like billing, that has a lot of data that you don’t necessarily want to lose. Or it might be something like an innovative experience. Earlier sessions talked about generative AI and talked about hardware selection for AI.

There are so many interesting use cases for the core business of serving our customers in their claims, whether that’s fraud detection and analysis, whether that’s interactive experiences for chatbots, for service models, where we want to take advantage of the latest and greatest models and be able to do interesting things for our developers. Those are the kinds of use cases that are tying us to various clouds. Maybe CRM services. Every business is different, but whatever they have, you need to work with your partners. The infrastructure has to serve the business. We don’t get to tell them what to do. They have to tell us what they need to do. Then we look for the opportunities to optimize the cost to serve across that footprint.

Identifying those anchor services, and then the data. How is the data going to flow? What do you need the data for? Who are the services and users, and what are they going to need? How do we keep it compliant? How do we keep it secure across the footprint? Those are much more difficult conversations. Because everyone wants their data right where they are, but massive cost savings by putting it on-prem. What needs to actually be there? How do you create the right tiering strategy with your partners? Which you start to talk about different language terms than many businesses use. They don’t know SLOs. That’s not their life. They aren’t going to be able to tell me a service level objective to achieve an outcome. They will give you a feeling or a pain point of an experience that they’ve had.

Then, I have to go figure out how to characterize that data so that I have a target of what we can’t get worse than for their current experience or where they have pain today, and so where we need to turn it to, to actually improve the situation. Modeling your data, modeling and understanding what is needed where. Then, making sure you have real alignment with your partners on the data strategy for, might be sovereignty, I don’t personally have to deal with sovereignty, but certainly, for compliance and audit, is absolutely critical. It requires a lot of time. It has a lot of business stakeholders.

It is the most often overlooked strategy whether you’re going to the cloud or coming from the cloud. It will cost you if you don’t take the time to do that correctly. It will cost you in business outcomes. It will cost you in your customer service experience. It will certainly cost your bottom line. This is the step everyone forgets. Know what they don’t want to let go of, because you will not serve them well if you take it away. Know what they need to have access to, and make sure it is the gold. It is the thing that they have access to always.

Now you got that. You’ve got a map. You’ve got a map of your dependencies. You’ve got a map of your SLOs. You’ve got a map of where you’re trying to go. Now you need to look at your technology decisions. You get to choose your hardware, your locations for your data centers, your physical security strategies, all sorts of fun things. Then, you got to figure out what you’re going to expose to your users to actually do the drain, to actually move things from one location to another. You need to really understand what you’re building, why you’re building it, and then how you’re going to move people to actually create the right scenario to actually execute on this cost savings and vision. Then you create a roadmap, and you try and actually execute to the roadmap. That is the overview of what I’m about to talk about.

1. Start with Your Developers

Me, personally, I always want to start with my developers. I always want to start with my customer. For me, infrastructure, we’re the bottom. We’re the bottom of the totem pole. Everybody is above us. I need to know my data platform needs. I need to know all my different service layer needs on top of the platform, whether it’s your AI, ML. Then, ideally, you also need to turn that into an associate experience and a business outcome.

This is a generic stack with a bunch of different elements for the data and control plane. No way that I can actually move the ephemeral load or the storage services if I haven’t exposed a frontend to my developers, particularly my new developers, that is consistent across the clouds. Start with a hybrid cloud stack. What are you offering for new developers? Stand it up tomorrow. It’s the most important thing you’re going to do in terms of enabling you to change what’s happening below the surface. We start there. We start with our developers, what we need to expose. Those are good conversations. What do they need? If you have one team that needs one particular service, they should build it. If you have a team that’s going to actually build a service that’s going to be used by 4 or 5 or 7 or 12 different applications, that’s one that belongs in your primary platform as a service layer. Kafka, messaging, that’s a good example of one that you probably are going to want to build and have generically available across clouds. That’s the way.

2. Understand your Cloud Footprint

I will actually talk about turning the cloud footprint into a physical footprint and how to think about those problems. Our most primary cloud had over 100 different instance types. This is what happens when you lift and shift a bunch of pets to the cloud. You end up with a lot of different instance types because you tried to size it correctly to what you had on-prem. The good news about the cloud is that there’s always a standard memory to compute ratio. You’re going to get 4 gigs or 8 gigs or 16 gigs or 32 gigs per vCPU. That’s the plan. That’s what you do. You have a good provisioned capacity. Note I say provisioned. Utilization is a totally different game, and how you actually measure your utilization is where you’re going to get a lot of your savings. It’s important that you understand provisioned capacity versus utilization. What did we do? We took that big footprint of 103 whatever different instance types and we turned it into a set of 3 primary SKUs.

A general-purpose SKU, a big mem SKU, and an HPC style SKU for all the data analytics and ML that is happening on our company’s footprint. That was the primary. Then we had a bunch of more specialty variants. I don’t particularly want to keep a bunch of specialty variants for a long time. Again, not where infrastructure on-prem is going to be your best cost choice in savings. For now, there are certain workloads that really did need a JBOF storage, cold storage SKU. This is a big savings opportunity for us. There were definitely reasons why we got to that set of nine.

This one, we could have a longer debate about. Your provisioned capacity of network to your instance is very hard to extract. Each cloud is a little bit different in this domain. You can definitely see how much you’re spending on your subnets. You can see the general load across. You can see the various elements of your network topology designed in your cloud. Correlating beside the provisioned capacity, which usually your instance type will tell you what you have, but actually understanding how much of that network interface you’re using in terms of your actual gigabits is very hard. Different clouds have different choices. You can do a lot of things with exporters. If you’re using any kind of OpenTelemetry exporter and you have those options, you can try.

The hardware level metrics that we who build on-prem are used to, Perfmon, everything that you can pull out of PMU counters, everything you can pull out of your GPU directly if you’re actually in that space, you do not get that. That is probably the hardest area. Got a good sense of compute, something to think about in general from a compute perspective turning cloud to on-prem, is that for cloud, you are provisioned for failover. You have to double it. Also, if your utilization is, let’s say, 8%, 12%, 14%, you can hope your users will get towards 60% in your on-prem footprint as you move them towards containers. There’s hope and there’s necessity. Nothing moves overnight. You can do some layering to assume you’re going to get better utilization because you’ll have better scheduling, because you’ll have more control.

Ultimately, you still have to leave a buffer. I personally chose to buffer at the 40% range. Everyone has a different way they play the game. It’s all a conversation of how fast you can manage your supply chain to get more capacity if you take less buffer. Double it and assume you’re going to lose 40% for failover for HA, for all the ways in which we want to make sure we have a service level objective to our end users for inevitable failures that are happening on-prem.

3. Focus On your Route to the Cloud

Let’s talk about the network. Enterprises have fascinating franken-networks. I did not know this, necessarily. Maybe I knew this as an associate working at a company in my 20s. What happens? Why did we get here? How does this happen? What used to happen before COVID, before the last 15 years of remote work, is people had to go to the office to do their work. The network was the intranet, was the physical perimeter. This is what most of these businesses were built to assume. You had to bring your laptop to the office and plug in to be able to do your service. Then, during COVID, people had to go remote. Maybe that was the only first time it happened. Maybe it happened 10 years before that because they wanted to attract talent that could move and go to different locations, or they just didn’t want to keep investing in a physical edge footprint.

Whatever the reason, most of those bolted on a fun solution three hops down to try and do what we would normally do as a proxy application user interface to the cloud. Assume you have a large edge footprint and a set of branch offices on, let’s say, MPLS, which is rather expensive, it can double your cost per gig easily. Some places you’ll probably see it 5, 6, 10 times more expensive per gig. You’re probably underprovisioned in your capacity because you paid so much for this very low jitter connectivity, which someone sold you. That’s on a cloud. I’m not counting that in my eight clouds. I just want you to know that MPLS is done through a cloud, and you actually don’t know that it’s distinct two or three hops away from where you are. Lots of failure domains because you’re probably single sourced on that.

Then, it gets backhauled to some sort of a mesh ring. You’ve got some mesh that is supporting, whether it’s your own cloud, whether it’s their cloud, there’s some mesh that is supporting your connectivity. That goes into maybe your branch office, because, remember, all your network protocol and security is by being on-prem, so you’ve got to backhaul that traffic on-prem. Then that will go out to a CNF, some sort of a colocation facility, because that’s probably where you were able to get a private network connection, which if you are regulated, you probably wanted a private network connection to your cloud provider. That’s the third hop. Now that’s maybe the fourth hop, it depends.

Then you go to a proxy layer where you actually do RBAC, identity access-based control. Does this user, does this developer, does this application have the right to access this particular cloud on this particular IP? Yes, it does. Fantastic. You get to go to that cloud or you get to have your application go out to the internet. I talk to a lot of people in my job at different companies.

Most of us have some crazy franken-network like this. This is not easy to develop on. Think about the security model you have to enforce. Think about the latency. Think about the cost. It’s just insane. This is your route to the internet. This is going through probably an underprovisioned network. Now you have to think through, where do I break it? How do I change the developer compact so that this is their network interface? Wherever they are, any of these locations, it goes to a proxy, it goes out to the network. That’s it. That makes your whole life simpler. There’s a lot of legacy applications that don’t actually have that proxy frontend, so you have to build it. You have to interface to them. Then you manage it on the backend as you flatten out this network and do the right thing. It’s probably the hardest problem in most of the enterprises, just to give you a sense of that network insanity.

4. Simplify your Network and Invest in Security at all Layers

Again, all those boxes are different appliances. Because you have trusted, untrusted, semi-trusted zones, which many people believe is the right way to do PCI. Makes no sense. In the cloud, you have no actual physical isolation of your L2 and your L3, so if you promulgated this concept into your cloud, it’s all going on L3 anyway. You’re just doing security theater and causing a lot of overhead for yourself, and not actually doing proper security, which would be that anybody who’s going across any network domain is encrypted, TLS, gRPC. You’re doing the right calls at the right level and only decrypting on the right box that should have access to that data.

That is the proper security model, regardless of credit card information, personal information. This security theater is not ok. It’s not a proper model to do anything, and it’s causing a lot of overhead for no real advantage. Nice fully routed core network topology. Doesn’t have to be fully routed. You can actually look at your provisioning, your failure rates, your domains and come up with the right strategy here. That is not the right strategy. Maybe I’ll put one more point on it. Once you look at this franken-network you have and the security model you have, regardless of the provisioned capacity that somebody is experiencing in the cloud, it’s usually going to be the latency that has hit them long before the bandwidth. There is a correlation, loaded latency to your actual bandwidth.

Fundamentally, the problem to solve is the hops. Make the best choice from a network interface card and a backbone as possible. Interestingly enough, because the hyperscalers buy more at the 100-gig plus increments, generally, if I could have gotten away with 25 gig, if I could have gotten away with lower levels, go where the sweet spot of the market is. You’re not going to change your network design for at least five to seven years. It’s just not going to happen. Better to overprovision than underprovision. Go for the best sweet spot in the market. It’s not 400 gig, but 100 gig is a pretty good spot, 25 gig might be fine for your workloads and use cases.

5. Only Buy What You Need

Only buy what you need. I already gave you my rules around where you want to have doubling of capacity from your cloud footprint to your on-prem footprint, how you want to buffer and think through your capacity in those zones. When you’re actually looking at your hardware SKUs, very important to only buy what you need. I have a personal bias on this. I’m going to own it. A lot of people who sell to the enterprise sell a lot of services. Whether that’s a leasing model. Whether that’s call me if you need anything support. Whether that’s asset management and inventory. Whether that’s management tools to give you insights or DSIM tools to give you insights. These to me don’t add value. Why don’t they add value? Supply chain has been a real beast for the last four years.

If I’m locked into every part of my service flow, running on somebody’s DSIM that only supports their vendor or doing a management portal that only supports them as a vendor, I have lost my ability to take advantage of other vendors who might have supply. When I say only buy what you need, I mean buy the hardware. Run open source. It’s actually quite excellent. They’re probably using it if they’ve been using the cloud from a developer perspective, at least at the infrastructure layers. I’m not talking about PaaS layers. Truly at the infrastructure layers, they’re probably running Linux. They’re probably running more open-source choices.

If that’s the case, I personally looked at ODM hardware. I like ODM hardware. There’s ODM. ODM is a model, you can buy from somebody who’s a traditional OEM in an ODM style. That’s basically to be able to purchase the hardware that you want, to have visibility into your firmware stack, your BIOS, your maintenance, so that you actually can deploy and upgrade if you need to in post. Which is important to me, because right now I have a massive legacy footprint, but a bunch of developers building net-new stuff. What my memory ratios are right now may not be what they want to have in the next two years and three years, or storage, or fill in your note.

Doing this work, basically, and taking a model here of 1,000 cores, 1 terabyte of memory, yes, 1 petabyte of storage, so just normalizing out, we got about 50% or 60% less. That’s with all the bundling I mentioned. Double your capacity and buffer 40%, it’s still that much cheaper than those equivalent primary SKUs for vended capacity for compute and storage. That has nothing to do with PaaS, nothing to do with the awesome things in cloud. This is very specific to my workloads and my users. Your mileage may vary.

6. Drive your Roadmap

Finally, go take that puppy for a ride. You got to go do it. It’s great to have a model. It’s great to have a plan. You have to actually start that journey. We started our journey about June of 2023. The decisions were started and made in February, the analysis began, of 2023. We started making our first contact towards actually buying new hardware, actually looking at new data center location facilities in June of 2023. Issued out our RFPs, got our first NDAs signed, started our evaluations on those.

Actually, did our physical site inspections, made sure that we understood and knew what we wanted to contract for based on latency characteristics. By basically July of this year, we had our first data center site up and built on the new variety that is actually geo distributed. That was not actually taken into account in the six data centers we had previously. We’re a failover design. Then, had our first units delivered in September. Had everything up, running, debugged, and started to serve and vend capacity and compute through the end of this month for our first site.

Then, our second site coming up next year. Those didn’t happen overnight by any means. If I were to show you the hardware path of building out a new Kubernetes flow, of actually ramping and pushing up OpenStack for our fleet management on-prem, those were happening very similar timeframes, end of last year to build up the first, to really give a consistent hybrid cloud experience for our end users to onboard them there, running on top of public clouds, but getting away from the vendor locked SDKs into true open source. Then, giving us capabilities of later migrating so that you stop the bleed as you then prepare underneath the new hardware you want to have.

Lessons Learned

I have one more, things I wish I had whispered into my ear when we started this journey. There’s no amount of underestimation in terms of, you need the right team. To go from buyers to builders, you need the right folks who can do that. Doesn’t mean you can’t teach people. Doesn’t mean you can’t work together. You need the right senior technical talent and the right leaders in the right spots who’ve done this before. You need at least two years. You need a leadership team who understands it’s going to take two years. That they have to be wanting and willing. I’ve seen too many partners on this journey, or friends on this journey say, yes, it seems like a good analysis, but six months in, nine months in, new leader, we’re out. You’re not going to succeed in anything if you don’t have the willpower to stick with it. It’s going to be at least two years. Hardware and software don’t come together, or they shouldn’t. You need to really think through your software experience, your user experience, your hybrid cloud. You can do that now.

There are so many ways in which vendors get locked. You get locked into the services and the use cases of Amazon or Microsoft, or love them all. You can start to break that juggernaut immediately. You should for any reason. Whether you’re coming on-prem, whether you’re doing a hybrid cloud strategy, whether you want to find a different cloud for a specific use case. There’s a bunch of GPU focused clouds because it’s hard to get GPUs in the clouds. Whatever your reason is, understanding what is anchor and you’re going to keep it, and taking everything else out of the proprietary stacks gives you autonomy in making decisions as a business. If you care about margins whatsoever, it’s a good choice. Detailed requirements.

If there’s anything I found on this journey that I had to say to myself over and again is, do not let perfect be the enemy of dumb. It’s all going to change anyway. That’s the point. Take the best signal you can from the data you have, make the decision you have. Document your assumptions and your dataset and what might change it, and then go. Just go. Just try and create an environment for your team that it’s ok. That you’re going to screw up, it’s ok. Because there’s no way to get it right straight out of the gate.

The best thing you can do is to talk to your customers and make sure you really understand their requirements in their language. If you don’t have those conversations, you are definitely wrong. Maybe the other thing that is interesting is, open is not so open, depending on which layer of the stack you’re looking at, depending on even if you think you’re on the managed Kubernetes service that should in theory be the same as Kubernetes, no, it’s not. They’ve built all sorts of fun little things on the backend to help you with scaling.

Breaking it, even where you think you’ve chosen a more reasonable methodology, can be hard. I would be remiss if not saying, in this journey, there’s a lot of people who have helped us in the open-source community. That has been wonderful. Whether it’s CNCF and Linux Foundation, OpenStack, OpenBMC, Open Compute Project. This community of co-travelers is awesome. Very grateful for them. We’re members of most of these organizations.

Questions and Answers

Participant: The two years, the timeframe that you said, is it per data center?

Weekly: For me, that two-year roadmap is to go from six on-prem data centers to two data centers. Again, whether you do two or three is a choice for every company. You need two, because you want to have active-active. Unless you have a truly active-passive footprint, which maybe you do. Most companies want an active-active footprint, so you need two physical sites. If you have only two physical sites, you’re going to be writing your recovery log to the cloud. That is your passive site. That is your mirror. If you would rather do that on-prem, then you would want a third site. That’s a choice. It should come down to your appetite for spending in the cloud, where and why and how you want to think through your active-active and your recovery time. Cloud’s a great recovery place. It tends to have pretty good uptime when you have one of them. We’ve had some consternation given our journey and our experience in the cloud.

Again, I think that’s very much to the user, if you want to do three sites versus two. That’s the two years for the six to the two, or three. The longest end is usually the contracting on the frontend, doing the assessment, doing the site assessment, making sure they have the right capacity. Depending on what you’re purchasing from a data center colocation facility provider, I’m a huge fan of colocation facilities, if you are less than 25 megawatts, you don’t need to be building your own data center. Colos are great. They have 24-7 monitoring. They have cameras. They have biometrics. They are fantastic. They’re all carrier neutral at this stage. If you looked 10, 12 years ago, you might have been locked into a single service provider from a network perspective. All of them have multi-carrier at this stage. It’s a fantastic way.

Colos tend to have interesting pricing in terms of retail versus commercial versus high-end users, where you are actually having a colo built for you for your primary use. Most enterprises are going to be over retail, but way under commercial use. Pricing is different there than maybe other places. All the cost models I showed are very much in that over retail, if you’re under retail, the model does not hold. If you’re over retail size, they’re going to show pretty similar economics from a site perspective. Colo facility location buildout is usually 8 to 12 weeks. If you’re using a colo provider that doesn’t have a carrier you want to use, getting network connectivity to that site can be very time consuming. Outside of that, everything else is pretty easy to do.

See more presentations with transcripts

Evaluating and Deploying State-of-the-Art Hardware to Meet the Challenges of Modern Workloads

Transcript

Geico – Infra Footprint

What Changed?

The Cost Dilemma

Hybrid Cloud 101

1. Start with Your Developers

2. Understand your Cloud Footprint

3. Focus On your Route to the Cloud

4. Simplify your Network and Invest in Security at all Layers

5. Only Buy What You Need

6. Drive your Roadmap

Lessons Learned

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

The Monotonicity Theorem: How a Simple Condition Guarantees Equilibrium Uniqueness and Efficiency | HackerNoon

Cox criticizes 'conflict entrepreneurs' in wake of Kirk's death

iPadOS 26 launches tomorrow: These are the best iPads to pair with Apple’s new overhaul – 9to5Mac

I won’t take gaming on Apple TV serious until 3 changes happen

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Geico – Infra Footprint

What Changed?

The Cost Dilemma

Hybrid Cloud 101

1. Start with Your Developers

2. Understand your Cloud Footprint

3. Focus On your Route to the Cloud

4. Simplify your Network and Invest in Security at all Layers

5. Only Buy What You Need

6. Drive your Roadmap

Lessons Learned

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News