Building Resilient Platforms: Insights From 20+ Years In Mission-Critical Infrastructure

Transcript

Liste: I’m going to talk about building resilient platforms. As the prior talk went over how to consume cloud platforms, I’m going to talk about how to build these platforms. While the talk is from the perspective of being a builder, I’ve spent over 25 years building various platforms that support critical applications. We all build platforms. We all consume platforms in some shape or form, so hopefully this talk resonates with you, either as a consumer, as a builder. While my perspective has been about building infrastructure platforms, we build software development platforms, we build platforms, we build messaging, for that matter platforms for banking as the prior talk spoke about. This talk is applicable to anyone pretty much. You build software for someone else, and so this is about the principles of how you build software that are consumed by other people at scale.

I am currently the head of infrastructure at American Express. I’ve been there a couple years. I was at JPMorgan Chase. I was there for nearly 10 years. Then I was at Goldman Sachs for 10 years before that. I happened into banking, so my background was in network and telecom. Prior to that, I built basically big networks for internet, cable, as well as offshore networks. I did a lot of different things lower in the stack and really ended up in New York, because my wife’s a New Yorker. We had a baby. She said, we’re moving back to New York. You can come with me, or you can stay where you are, but I’m moving back. I moved to New York.

Back then, the only good jobs with someone being a low-level infrastructure person was in financial services. I’ve stayed there ever since. It’s been a great field because, as you can imagine, the banks care a lot about uptime. They care a lot about resiliency, and they invest a lot in the underlying substrates. I’ve been lucky to be in a field where I’ve had continuous investment in doing really interesting platform work at scale.

What is a Platform?

Let’s start with the definition of a platform. Pulled this out of a dictionary. My art is completely GenAI. One of the things I’ve been thrilled with GenAI is I could finally get content that is copyright free. This is all created by an engine. I asked it to say, use a platform art in the style of Edward Hopper. Every image here looks nearly the same as in the backgrounds, but I wanted to make sure that I could put in imagery that no one will sue me for. This is all my creative imagery generated by an image generator. The platform definition. Dictionary definition of a platform, if you start with the top, is a raised platform like what I’m standing on, where people or things can stand. It’s a great metaphor.

If you think about platforms in your life, like this is a platform, a train platform. You could extend it further and say, what about city platforms like the train system, sewage? These are all platforms that you don’t really think about. They’re just there. They hide things underneath them that you don’t need to worry about and never think about. You take them for granted. The technology definition of a platform is a set of integrated technologies that are used as a base to develop other applications or processes. Cloud. Now, I’ve been working on what I thought of as, we never called it cloud, but 25 years of developing platforms that others consume to write their software on top of. My job has always been to do that in a way that no one knows I’m there, in the best possible manner as a platform builder. People, they never thank you. They don’t know you’re there. They don’t appreciate the work.

Ultimately, you’re doing a really good job when they never call, because then the platforms work. They’re there. They’re consistent. They perform. For the platform builders, thank you for doing this as well. For those of you that are consumers, appreciate that you only see the tip of the iceberg when you consume a platform. When you’re using any of the cloud platforms we spoke about earlier, Azure, AWS, GCP, great cloud platforms, but the complexity that goes into building them is luckily underappreciated.

In fact, I’m using a power platform right now. There’s power in my laptop. There’s power in this room. The complexity that goes into generating that power is something that I take completely for granted. I’m incredibly happy that there are specialists that do that on behalf of us each and every day and allow us to do other things. Platforms are all about being able to leverage those components so you can focus on your job at hand.

The Principles of Infrastructure Platforms

I’m going to talk about these principles that I wrote down. These came out of a white paper I wrote a few years ago to use internally, to really talk to my broader team of how should you think about building and what has helped me in my journey of building platforms over these 25 years. These are in no particular order. Some are more important than others, I think, or they apply at different times. This is what I’m going to cover during this talk, is these principles.

1. Deliver an Intuitive Experience

We start with the first principle. I said they are in no particular order of importance, but this one is probably one of the most important ones. Great platforms are intuitive. They deliver an intuitive experience. I started using public cloud about 20 years ago. It was not an intuitive experience. It was actually pretty gruesome. The amount of complexity, the amount of difficulty, the obtuseness of those platforms made it very hard to consume them. Using a public cloud today is a really intuitive and a great experience. Why? Because they’re integrated platforms, they’re intuitive, they hide and mask all that complexity under the hoods. They appear magical. They make magical things happen. I love this quote from Arthur C. Clarke, which says that any sufficiently advanced technology is indistinguishable from magic. I’d say that’s true for great infrastructure platforms. They operate and they build magically.

All the things that you don’t have to worry about as a consumer is done because these platforms build all that together. I’m going to talk a bit about how to make that magic work and how that assembly fits together. The other quote from Steve Jobs, and I’ll talk a bit about this simplicity point. I speak about hiding complexity. Great platforms, and if I think about when I’ve been very successful in building what I’ve been very proud of as infrastructure platforms, they really have hidden the complexity away. As Steve Jobs said, simplicity is the ultimate sophistication. It is very hard to make something appear very simple. It is very easy to expose complexity. It is not easy to make something that just works in a way that is incredibly simple and intuitive. What I want to talk about is the complexity that goes into making that happen.

2. Build Common and Interchangeable Components

The principle number two here, which is, build common interchangeable components. I spoke about great platforms are integrated, so things work well together. When you go to a cloud provider, and we spoke about multi-cloud earlier. This applies if you run on-prem, if you run in private cloud, public cloud, use multiple clouds, the fact is that you can run your software. The prior talks, we had a really complex banking system. Now, there’s a lot of different components, and that will use messaging, databases, web tiers, multiple different components. They all interlock together and work together. Platforms have to have common interchangeable components.

Just the simple example is observability. Probably, for those of you that have been in this industry a long time, where observability is completely fragmented, impossible to easily troubleshoot across components, and that even held true in the early days of public cloud. Now when you use a public cloud platform, you get common logging, common observability, common telemetry. You can have common dashboards and consoles around that. The beauty is because they’re interchangeable components. Lego, as a metaphor, is a really valuable one. Think about how Lego has only a very finite set of blocks that interlock, but you can build infinite number of shapes from it.

The point about making this common and interchangeable is, these platforms dictate to the service providers, meaning the people building databases, messaging engines, and so on, dictates to them. You do not get to choose what kind of observability to use. You do not get to choose how to do identity, and so on. Because they insist on common interchangeable components, means as you as a consumer get the benefit of, this just works together. I think of that as freedom from choice.

From the service providers, again, people building the components in there, they don’t get to choose certain things because the platform insists on this commonality that then interlocks it all together. Or identity. Imagine you have a three-tier application, let alone something as complex as the prior talk. You can deploy it, and it seamlessly manages identity. It manages how you deploy that application on top of that stack, which is, of course, the beauty of having interchangeable components.

3. Use the Three S’s: Stability, Security, and Scalability

I’ve been in financial services for the last 20-plus years. Which means that the software that runs on top of the platforms that I built tend to be mission critical to the organization. Could be trading systems, banking systems, credit card processing systems, but they’re all systems that do not tolerate downtime. They do not tolerate security breaches. They do not tolerate that you cannot scale with their business. We’ve had to build around the premise of what I call the three S’s: stability, security, scalability. They’re non-negotiable. Sometimes you say in real estate, for example, in real estate you can only get two out of three. You can get location, price, or size. You can only choose two. You can never get all three.

In this case, you have to have all three. You cannot opt out of any of these three. If you think about how you build, you have to really think about, how am I going to achieve these three things forever? Not just when I launch a platform, but every day since, it has to be able to do this. Stability probably being the primary one, just has to work, and work all the time, and work consistently. Imagine that’s easy to say if you never change anything. It’s easy to achieve stability if nothing ever changes. If nothing ever changes, you probably won’t have a secure platform because you’re not patching it, you’re not dealing with vulnerabilities. You have to patch it, which means that then you’ll have something, you have chaos in the environment that now is impeding your scalability.

Then, just think about scalability, I think of building for 10x. Every platform builder wants to be wildly successful. Think about how the public cloud provider, AWS, I don’t think they would have imagined 20 years ago that they would be at the scale they are today. You have to anticipate, one of my bosses described it like World War Z, where the zombies are coming at you, and you can’t stop them. Great platforms have that kind of experience, where you have customers who love what you’re building because what they were dealing with prior is not giving the same experience, and so which means that you have to build for huge scale.

Anticipate that people are going to over-consume you. Most platforms fail for this reason. They don’t scale with their customers. You built something that worked great until too many customers came and used it, and now it doesn’t work great anymore because you have too many bottlenecks across your stack. Being able to balance these three is really complex, and as I said, non-negotiable. The one that is negotiable, I didn’t put it on here because it is actually something you can negotiate, is cost.

Sometimes you can invest a lot in building platforms, sometimes less so, depends on the business you’re in. I will say that cost can fluctuate as in how much you’re willing to spend. To achieve these three S’s, you have to essentially spend more or less over time to manage through that. I will say that these three, even though they always hold true, you sometimes over-invest. I’ve had cases where I have not been able to sufficiently keep up with scalability, and so we then dial back on our patching cycle to scale more to go back to that. It’s not that these are always 100%, but there’s a certain floor that you never can go beneath for you to be successful with this.

4. Be Evergreen

Fourthly, and this relates to it, be evergreen. Security and scalability are critical. To be constantly secure or secure enough, you’ll never be 100% secure, but you have to be sufficiently secure, you have to be continuously maintaining and managing your underlying environment, and you have to be evergreen. That’s really hard, because being evergreen means you have to patch, and again, think at scale. The plants I’ve run, just to give you an order of magnitude, is tens of thousands of servers, hundreds of thousands or millions of VMs and containers and databases and messaging brokers, and so on. Imagine that you’re maintaining an environment that has millions of widgets.

All of those widgets have a life cycle, and you have to maintain and update those widgets on a continuous basis. Sometimes the cycle is every quarter. Sometimes every six months. Sometimes every year. Sometimes it’s actually every couple of weeks. To manage that underlying plant without customer disruption, or minimal customer disruption, is really hard. Being evergreen is something you have to think about up front, and managing through that. A lot of you that write software, cloud natively, this is not so much of a problem, because you can probably do rolling upgrades, your horizontally scalable apps, and so they can deal with individual downtime. A lot of what we run in financial services is not written in a way that lends itself very well to that.

Then think about, how do I quiesce and patch my database that cannot tolerate downtime? How do I manage those change windows? How do I coordinate with my clients and client-side software? Even more so, what if I upgrade the client contract, the SDK, the API they use? How do I manage through that? There’s an enormous amount of complexity that goes into staying evergreen, but you have to. Because when you get behind this, and I have at times where we have been so far behind on managing this, so we had major security issues, let alone something like Log4j, and it’s incredibly hard to catch up. I think that thinking through this up front and working through this is something they have to say. It’s very easy to stop doing this, but if you stop doing it, you cannot catch up, in my experience. It needs eternal focus.

5. Avoid Undifferentiated Heavy Lifting

I’m going to talk about two angles of staying focused. The first angle is principle five, avoid undifferentiated heavy lifting. It’s very easy to overbuild and do engineering for the sake of engineering. Especially when you work with a lot of smart infrastructure engineers like, couldn’t we do this? We could build our own database. We could build our own messaging broker. I’ve had teams who had done that. Even we could write our own operating system. There are some things that are good ideas and some things that aren’t. If you’re in a financial institution, you don’t need to write your own database. There are plenty of great database engines out there that you can leverage, at least in this day and age. You don’t need to write your own messaging broker. It’s very important to stay focused on what is it that your clients care about, and how can you build on top of that and avoid undifferentiated heavy lifting. That, again, requires vigilance. I’ll speak about a couple different points subsequently. Only build what is necessary.

An example that I’ve used, for example, is we have made databases enterprise ready. What I mean by that is a lot of my jobs has been, for example, taking a Postgres engine and wrapping around that things like automatic failover, backups. The compliance-ready things that we need for those databases to operate at scale in financial services, but not rewriting the Postgres engine, as an example. Use what is there already and then wrap and build a minimum set of controls that you need around it, but no more. Again, very hard when you work with a lot of smart people who get carried away with all the great things we could do.

Unlike some technology companies, and I’d love to work at some of them at times, get the freedom of saying, innovation and that’s part of your job. In our job, our job is really to provide platforms for business software. You have to really think about what is required for that and what is not required for that, and apply what I think was Occam’s razor against it. Do truly only the necessary parts. This quote of, if you want to go fast, go it alone, but if you want to go far, go together. You need to build on top of others to really build what the business needs.

6. Be Opinionated

The sixth one here, the corollary to that is to be opinionated. Not only will you be tempted to build unnecessary things, but trust me, your clients are going to want everything. You need to be opinionated as a platform owner, saying, these are the things we’ll do and here are the things we won’t do. You won’t be the most popular person, but that’s actually a good thing. It’s good to say no and do fewer things well than a lot of things poorly. Again, to use my database example, do you really need 10 different SQL engines out there? Do you need MySQL, Postgres, Oracle, Sybase, MS SQL? No, you probably can focus down.

Think about what is it that your clients really need, and focus on the mainstream, the 80%. Recognize the fact that you won’t be able to please everybody, and sometimes your clients are going to adapt their software to your platform, not the other way around. You cannot adapt your platform to every client use case, and so you have to live with the fact that you can only do so much with your resources and you have to be very opinionated about it and say no.

In my experience, when I’ve said yes too much, it’s always ended in tears because I’ve ended up building more things that I couldn’t sustain over time, than fewer things that I could do really well. Of course, my clients saying, you can’t maintain it, it’s not sustainable, it’s not stable, because I spread the resources too thin across too many things. The experience has been like, it’s better to say no more often than yes. You have to deliver what gives the most value for the most people. You also have to retire technical debt. You have to be willing to say, this is no longer commercial to do anymore, I’m going to kill this thing that you love and you hold dearly, but it has to go to bed. It has to go to the chop house. No client ever will enjoy that because that means they have to port, they have to rewrite their application. It’s work for them. There’s no incentive for them to say yes to that. The magic is in how do you work with your client base to manage through that and be opinionated around it.

7. Be Long-Term Greedy

The seventh one, which plays into the two prior, be long-term greedy. Think about the things that you can sustain over time rather than the things that you can do immediately. I use this analogy a lot with my teams. How many have dogs and kids? I’m sure your kids don’t walk the dogs as often as you do, so when you brought that puppy into the house, it’s so cute, the kids want it, I love to have a puppy, but you’re not bringing a puppy into your house, you’re bringing a dog. You’re bringing something into your house that you have to sustain for at least 10 years, most likely even longer. The dog has to be walked, has to be fed, has to be cared for. You have to have conviction up front that I really want that dog. Yes, everyone wants a puppy, but you can’t return it or you shouldn’t return it. I know people do cruel things, but, ideally, when you bring that puppy in, you made a long commitment to sustaining it. Platforms are no different.

Once you have clients running on it, you are committed for years to sustain it. That is a contract you made with your clients. You start using my stuff, I need to support you through thick and thin. Think about, as I spoke before, the compromises on three S’s, the sustainability, you have to be long-term greedy. Think about like, am I willing to take another dog into my house and care and feed for it? If you don’t have conviction around that, don’t say yes. I ask my teams these questions all the time. Are you really willing to do that? Are you willing to pay for that team, let’s say 2, 3 feature teams, for the next 10 years? Are you willing to sustain that? Are you willing to do all the nitty-gritty around managing vulnerabilities and patching and so on? If not, you shouldn’t do it. You should say no. It’s, again, easy for engineers to get very enamored by the opportunities, the possibilities, how sexy it is, something new. I’ve made mistakes several times.

For example, I made an early decision on container platforms and container hosting. First, we built our own hosting platform on top of Tomcat. Then we built another hosting platform on top of Docker. Then we did Mesosphere. Now, of course, we’re running Kubernetes like everyone else. I regret a couple of those moves because I had clients on all of those and trying to get those clients off those platforms was really hard.

I let too many puppies into the house, and I should have said no to a couple of them, and said, “We can wait a bit. The industry is really moving. Let’s wait. Let’s see what matures. Let’s make a decision in six months from now or a year from now, but we don’t need to make it right now”. Despite the fact that everyone’s clamoring they want it tomorrow, why are you so late? I think you have to deal with that kind of criticism. I will also say that I’m a big believer in letting the community innovate. What I mean by the community in my case is I don’t have open customers. The customers that I have are the customers that work in the same institution I do, but often they’re off building things.

For example, I built Kafka as a service several times over, but it was built first by customer groups that said this new thing out there called Kafka, we think is really powerful. Then we’d see Kafka pop up here and over there and over there. Finally, we said, this must be a thing, because six different groups are building Kafka, maybe we should offer it as a service. Waiting a bit for that incubation to happen is a bit of the magic there. Don’t be too early and don’t be too late. If you’re too late and there’s all these different Kafka variants that you can no longer manage and a lot of controls issues, not a good place to be. Too early too, again, making that choice too early is also very difficult. There’s a lot of art in getting that balance right.

8. Share Responsibility

Number eight, share responsibility. You and your clients have a contract. Often, it’s implicit, as in, this is what I’m going to do for you. Then the client assumes this is how they need to behave in that ecosystem. I recommend making it as explicit as possible. There’s always an implicit contract between a platform provider and a platform consumer. The more explicit it is, the better. It’s not just about the API contract or the SDK, whatever the software layer is. If you can also be explicit on the Amazon one here, which is about controls. These are controls I’ll manage. These are controls you’ll manage, around uptime, SLOs, patching cycles. The more that you can be explicit about this, the better. Because then the clients understand, the customers understand what they can expect from the platform, and behavior that is normal.

For example, you might tell customers that I reserve the right to once a quarter take down your infrastructure so I can patch it. You put it up there up front. For every database that cannot be quiesced, like as stateful databases, I reserve the right to once a quarter, I’ll tell you up front, but I reserve a 24-hour patch window that I need to do my job. If you don’t tell your clients that, they’re going to fight it tooth and nail. They’ll say, “I can’t take any downtime. I can’t live with it. You never told me about this. It’s unfathomable you do this to me”. Definitely being explicit around responsibility is incredible. The more explicit you are, the better.

Writing down SLOs, measuring them, talking about them. Writing down, again, not just your functionals, but your non-functionals as well, incredibly important in this. The original Bezos quote on this, which is about we build muck so you don’t have to. Yes, we build muck, and we’ll tell you what the muck is. Because, ultimately, that allows you then to really understand how to interact. The other way to think about it, as a customer, a platform can do many things for you, but it can’t do everything. I think of it like snowflakes versus ice cubes. Yes, a platform can vend many different shapes of ice cubes, but an infinitely beautiful snowflake is all different. Being very clear on this, these are the things we can build for you and these are the things we can’t do. Just as important what I can and cannot do is very important in this.

9. Abstract, Don’t Obfuscate

Number nine, abstract, don’t obfuscate. I’ll explain what I mean by that. First of all, every client is different. When we have thousands or tens of thousands of clients, or in cloud providers’ case, millions of clients, every client is different in terms of what they need. You have to meet them where they are. Some customers want to interact through a UI. They don’t really want to know what goes on under the hood. They just want to use a UI and that generates the underlying configuration and code for them. They’re very happy. They have no desire to know what’s under the hood. That’s me and my car, I have no desire to know what’s under the hood. I want the car to get me from A to B reliably and in a consistent manner. What actually happens on it, I couldn’t care less.

Imagine that user interface. In cloud, that’s very common too. A lot of customers are very happy being at that level of abstraction. Some want to go a level below. They want to use things like Terraform or other configuration, when they write their YAML and so on, and configure it that way. Some might want to be all the way down in the nitty-gritty of the weeds. Sometimes they want to, sometimes they need to. That’s why I talk about don’t obfuscate, just because a client is operating a high level of abstraction doesn’t mean they can do so all the time. Let’s say something breaks. You’re in the UI and you’re leveraging that to interact with your cloud. Something goes wrong. Now you’re trying to look under the hood to see, what actually did I configure? How’s it working? How does it assemble together?

If you’re not able to get into the details because a platform obfuscated that for me, that’s a bad place to be. My experience has been, build multiple levels of abstraction but allow clients to use any of them. Allow them to be all the way in the weeds, as in literally down at the API level, and expose as much of that as possible, all the way up to, here is a CLI or a UI that you leverage to do all of this for you, but then allow them to introspect and into the machinery if they choose to.

Again, a very modern version of this is, when I started developing, I started coding in assembly, which was incredibly painful. Actually, sometimes I had to write machine code even, which was even more painful. Then, of course, compilers became mainstream, and I started writing mainly C code. I could still introspect my assembly and my machine code. Now, of course, Copilot. I read this great article in The Times about vibe coding. A journalist created his own application on an iPhone to go scan the content of his fridge and tell him every day what nutritious meals to cook that day. He, journalist, no engineering background whatsoever, he generated the code for his iPhone, the middleware, and the cloud that ran in cloud. Generated, pushed it, and worked. Imagine like this, his level of abstraction, is all the way up there.

Now, of course, he couldn’t really troubleshoot it, but he certainly has an ability to generate. Imagine that these layers of abstraction has always been the case in our industry, and it’s not new, and it’ll always be the case. As in, we are always building new levels of abstraction, but ultimately, what does it generate? It generates machine code that runs on the system. Luckily, and I’m actually very grateful for that, that I don’t have to write assembly anymore. I think any of you that have dealt with that are probably grateful for that, too. Sometimes, you need to be able to introspect. No different than a mechanical engineer can take a look at my car and fix it for me. It’s important to think about that obfuscation. Let’s say you could never fix a car. You had to throw it out. That wouldn’t be a good level of obfuscation. Working through that is very important.

10. Stand on the Shoulders of Giants

Principle 10, which has been an incredible boon, especially in financial services, has been open source. I think they’re standing on the shoulders of giants. By not my quote, this is Isaac Newton, who quoted about how all of his innovation came from prior science. If he’d seen further, it was from standing on the shoulders of giants. I love that quote, because it expresses what open source has done for us. This is not unique or new, in the last 20 years, everything that we’ve been able to do in the industry I’ve been in, not absolutely everything, but a lot of it has come through the ability to leverage open source and open standards. Incredibly instrumental in being able to do what we do, because what do we get out of that? Enormous mind share. We share engineering resources across the board. We get to leverage innovation that others have done. We get to stay current. Think about, we get to see the code. Open source, of course you get to see the code.

The benefit of that, I mentioned as an example earlier, building a database like Postgres, making it enterprise ready, meant that we want to see how Postgres worked, but we want to build a control plane around that. With open source, you can do all that, and straightforward. We’ve never believed in open source as we don’t do it because it’s free. There’s no such thing as a free lunch, and that’s really true. Someone ultimately has to pay for those developers that are involved in the open-source community. I’m very grateful to any of you that are in the open-source community, but you still get paid somehow. For us, it’s never been about not paying. It’s been about having this community that could collaborate on open source. When I first started in financial services, there was none of this.

Then Linux came around, and every single financial institution, at least that I know of, Linux is core to its mission. Trading systems, banking systems, so it’s all leveraging Linux very broadly. Of course, since then, Kafka, Cassandra, Kubernetes, Postgres, MySQL, and so on, there’s a huge amount of open source, Terraform, that goes into this.

A lot of this, what you’d say is, is a mix of open source and proprietary parts wrapped around it, but nonetheless, this whole movement has been instrumental for our ability to be able to build platforms at scale. It is not that we don’t use proprietary closed source software. Of course, we do. I think most of you probably do at times. It’s not to say that doesn’t have value. Of course, it does. I’m not trying to denigrate that. Rather say that open source itself plays a really instrumental role in being able to build. Because also the last point, which is portable. In multi-cloud, again, Postgres as an example, you have a Postgres engine you can run on Amazon. They have Aurora Postgres.

You can do the same in Azure. You can do the same in GCP. You could write your application, your schema, to Postgres, and you can pretty easily port between these clouds. The benefit of open source is that it’s driven this commonality building block that is broadly available everywhere. Kubernetes, example, Kafka, and so on. That really gives a huge amount of leverage. You don’t have to rewrite your application every time you’re choosing your hosting platform. You have to rewrite a little bit, but you don’t have to rewrite everything. There’s huge value in that.

11. Build Culture, the Rest Takes Care of Itself

Then, finally, culture. I spoke about really how we build. This is how you build teams. The reason this is so important is, when you build a platform, you’re building multiple different components. You’re not all building that with a single team. It’s impossible. At scale, you have many different teams who are building different components that all need to fit together. For them to do so in a consistent manner, you have to have the right culture.

My experience has been, if you have the right culture, it drives great teams. Great teams build great products. You can get lucky. I’ve sometimes confused those things. I have been able to build, at times, great products, and thought I could repeat it, and found that that didn’t really work because the teams were not built in a way that was repeatable. Don’t confuse getting lucky with being good at this. Being good at this means that you have to focus on the culture. That’s where I spend most of my time nowadays is on culture and strategic outcomes. If I think about the three-dimensional culture that I really care about. First of all, big believer in empowerment, empowering teams to make as many decisions as they can without having to ask anyone for permission. I described earlier about, great platforms integrate. It’s freedom from choice. Here’s the things that you need to innovate.

For example, you’re the Postgres team building Postgres as a service. You have freedom to do that as you see fit. Here’s where you don’t have freedom. You’re going to use observability in this way, using OTel and other parts. You will plug into identity system in this way. They get freedom from choice. When it comes to how you build a database engine, the world is your oyster. Do however you see fit. The empowerment is incredibly important around the right culture and how you incent people to do that.

Secondly, diversity of thought. I have never had good teams who have been all built from the same people, from the same background with the same experience, over time. It doesn’t lend itself well to good outcomes because they don’t challenge each other sufficiently and they end up building things that are not sufficiently creative, and they don’t anticipate problems in the way teams that are more diverse of thought have. Being explicit about that is really important because it’s very easy to build monochromatic teams, as in, friends hire friends. They have hired people that they know that have done their jobs like them before.

Before you know it, you now have a team of 10 Bobs, and all the Bobs do think the same way, do the same things, have the same background. You end up with a team that is not as good as a team that has people from completely different backgrounds. You have to be explicit about it and explicit how you hire, explicit who you promote, and explicit how to manage it. Because if you don’t, as I said, it’s very easy to fall into monochromatic teams and they do not do the best job, in my experience, at scale. That’s the other domain.

Then, I think that being intentional about the leaders. It sounds a bit cruel, but you have to be very team oriented, which means sometimes you will sacrifice individual rights over team rights, because great teams need the right dynamics, which means that someone who is great over here might not be great over there. Being very explicit about moving people around, very explicit what successfulness is, and dealing with that, it’s hard, but it’s really necessary to do this at scale. This takes months to years, not days to weeks. I’m now two years into my current job and I expect to be in it for a long time, and getting the culture right is something I wake up and focus on every single day. That is the most important thing in my job. I think anyone who runs platform teams at scale, that is the critical part to do.

Key Takeaways

That’s the principles. Hopefully, I’ve given you food for thought along the way, but let’s go full circle. Platforms are pervasive in our life. What I’ve tried to illustrate a bit is the principles of building them and how to assemble them together. To some degree, we’re all platform consumers, and we all, to a degree, whatever we write, or offer, publish out to the world, is consumed by others. Being very aware of those circumstances and thinking about that is really important. Obviously, platforms build on top of platforms, they’re a layered cake. For example, we’re using a power platform today. We’re using a platform that provides us light and power, that is a platform, a utility platform.

Right now, as you think about cloud, like we build on top of cloud and other platforms and so on. Think of these as just turtles upon turtles, layers upon layers, but it keeps going infinitely down. Great platforms are built with strong foundations, which means that you just take them for granted, they have permanency, you don’t think about them. It is a job of unsung heroes. No one ever calls me up and says, I’m so happy your cloud is working today, how amazing. No, they call me when it breaks. How on earth could you let this break? I mean, no different, you don’t call Amazon or Azure and thank them for the amazing work they do, you take it for granted. That is a job, is to be transparent, to be unknown, and to be unsung is when you know you’re doing a really good job. Because when you’re not doing your job is when people know your name and start calling you. That’s not where you want to be.

Questions and Answers

Participant 1: You mentioned an example where you were saying that you regretted moving from one tech stack to another, like Docker and Kubernetes. What kind of things do you think about when, like you got, what is it, hindsight is 20-20, so looking back, you can see that. What advice would you give to someone who’s trying to make that decision now, and how would you think about that?

Liste: It’s an art, not a science, as they’re trying to figure out the mechanics of how to get it perfect. I will say being, first of all, cognizant of where the industry is. If it’s early in the life cycle, like it was in our case with the container example I used, we were a bit too early. I knew the folks at Google that worked on Kubernetes at the time. I knew the folks at Twitter who were working on Mesosphere. Of course, we’d been working with containers as an underlying Linux construct for a while, and Solaris before that. I was enamored with the technology, and I knew it held a lot of potential. I was impatient. I was like, I’d love to move to this construct. I think it brings so much value to it. Let’s go do it.

First, we did it ourselves in a proprietary manner. Then we went to Twitter. I think that if I’d taken a step back and been a bit more aware of all the innovation that was happening in the community, I would have taken a deep breath and said, let’s wait a year and then evaluate, and we’d be far better off. If it’s later in the life cycle, like now, now it’s an easy decision. Let’s say you’re still at the stage where Mesosphere was still a really credible alternative and Kubernetes had come out from Google, then at least it’s reached a certain degree of stability. I recommend at least waiting for that inflection point. It’s hard sometimes to know when you’re in the middle of it.

For example, with GenAI today, we are doing a lot of work with GenAI. We have not yet brought GPUs on-prem at scale because there’s just too much movement going on. I don’t know which model is going to win. Maybe they all will. I don’t know exactly the business model, how much of it will be with third parties, how much of it yourself. We have deliberately said, we’ll run all of this with third parties for now. At some point we might end up building, but it’s too early. I think it’s having that patience. As I said, it’s hard.

The challenge, if you get it wrong, you will have clients running on that and getting clients off, no one ever wants anything taken away from them, ever. They’ll fight it tooth and nail. Part of having lived that experience many times over and having people cuss me and be incredibly angry with me, and how could I? Those battle scars lead me to be very careful with when I get there, and so evaluating that. I would recommend that you think about the industry, think about the inflection points, and then think about the consequence of getting it wrong and how painful it will be, and try and avoid doing that.

Participant 2: You mentioned ruthlessly killing off tech debt, but at the same time, having an evergreen platform that has a long horizon. Do you have any reflections on balancing the acts?

Liste: Think about the evergreen. That’s part and parcel of the same, which is, if you are good at maintaining versioning and being evergreen, that by itself retires tech debt ruthlessly. I use an OKR often that I call 0114. I want zero people included in deploying and patching. I want to be able to patch everything within a day and be able to patch everything at least every 14 days. Let’s use this example of Postgres. Can you patch every version of Postgres if you had to in less than one day? Can you do so at least every 14-day cycle? If you think about that mindset, it means that you build the automation, you build all the scaffolding around it so that therefore you can do so on an ongoing basis. I think that a lot of the work is in getting teams to where they can do so as an afterthought. It’s not complicated. That’s about the ruthlessness of it is,

I think of it as automation is dogmatic. I’m not pragmatic with automation at all. I’m dogmatic. If you haven’t automated everything, you have work to do. Everything has to be automated. Because as soon as you get to scale, if you haven’t automated everything, you will fail, sooner or later. I’ve ended up running something that I thought was, again, a puppy, and I had 10 clients on it, then it became very successful. Now all of a sudden, I’m running 10,000 instances. I can no longer patch it manually. I haven’t written automation. I can’t keep up anymore. That spiral of death is awful. That ruthlessness has to be in, how do I make sure that when I do it, I can build all of that up front, I can put it into the system, I expect to infinitely scale, and I manage it.

Then the other part about ruthless tech debt is, as I spoke about before, getting the platform decisions wrong means you have to take clients off something to retire it. They will hate it. They will hate you. You don’t have a choice. You have that time to say, “You will no longer have that platform that you love because we cannot afford to sustain anymore. There are only 10 people left using it. I’m sorry”. It’s hard. It will always be hard. Just as you’re a giver, you’re also a taker. You have to be willing to take away people’s toys. You need a thick skin.

Rettori: I want to remind everybody that in public cloud, tech debt is real debt, so just have that in mind.

See more presentations with transcripts

Building Resilient Platforms: Insights from 20+ Years in Mission-Critical Infrastructure

Transcript

What is a Platform?

The Principles of Infrastructure Platforms

1. Deliver an Intuitive Experience

2. Build Common and Interchangeable Components

3. Use the Three S’s: Stability, Security, and Scalability

4. Be Evergreen

5. Avoid Undifferentiated Heavy Lifting

6. Be Opinionated

7. Be Long-Term Greedy

8. Share Responsibility

9. Abstract, Don’t Obfuscate

10. Stand on the Shoulders of Giants

11. Build Culture, the Rest Takes Care of Itself

Key Takeaways

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Ready or not, enterprises are betting on AI | News

How The Verge and our readers manage kids’ screen time

CATL-backed EV maker Hozon Auto prepares IPO: report · TechNode

I ditched my Google Pixel Buds Pro 2 for these cheaper buds and I didn’t miss a thing

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

What is a Platform?

The Principles of Infrastructure Platforms

1. Deliver an Intuitive Experience

2. Build Common and Interchangeable Components

3. Use the Three S’s: Stability, Security, and Scalability

4. Be Evergreen

5. Avoid Undifferentiated Heavy Lifting

6. Be Opinionated

7. Be Long-Term Greedy

8. Share Responsibility

9. Abstract, Don’t Obfuscate

10. Stand on the Shoulders of Giants

11. Build Culture, the Rest Takes Care of Itself

Key Takeaways

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News