Transcript
Renato Losio: In this session today, we are going to chat about platform engineering patterns for scalable software delivery.
I would like to give a couple of words about today’s topic, platform engineering patterns. Today we won’t discuss directly AI, but as our software systems grow in complexity, engineering organizations tend to adopt platform engineering. We would like to understand why, and what does it take to architect a successful IDP, internal developer platform? My name is Renato Losio. I’m a cloud architect by day, so I’m a practitioner myself. I mostly work with AWS stuff. I’m an editor here at InfoQ.
We are joined today by four experts coming from different companies, different countries, different sectors, different backgrounds. We’ll discuss the core components of modern platforms, the role of platform teams, and how to measure success and strategies. We’ll discuss failure stories maybe, personal stories, things that hopefully all practitioners will enjoy. I would like to give the chance to each of our panelists to introduce themselves and share their journey in platform engineering.
Boyan Dimitrov: My name is Boyan. I’m the CTO of SIXT. We’re a global rent-a-car mobility provider. We are offering different services in over 100 countries, helping people move in the most convenient way for them. Before joining SIXT, I spent about 12 years in different startups, and this is when I first got in touch with platform engineering. Over at SIXT, we are now on the 10th year of our platform engineering journey. We have about 40 platform engineers who are supporting a tech team of 800 across our backend, mobile, web, data, and back-office platforms. Over the course of those 10 years, we’ve definitely made quite a few things right. We’ve had our fair share of mistakes, me personally included.
Cat Morris: I’m Cat Morris. I work at Syntasso, which is a startup based out of London. We have built Kratix, which is an open-source platform framework. All about how to do platform engineering at scale. Before that, I worked at Thoughtworks for 4 years as part of the enterprise modernization platform and cloud service line. Working really closely with platform engineering teams to figure out and scale their platform. Both of those roles as product manager. My interest in this area is platform product management and how you can use product management techniques to scale your platforms.
Martin Reynolds: I’m Martin. I’m a field CTO at Harness, where we really have a software delivery platform that helps people take out a lot of the toil and manual steps from code all the way through to production. The thing that actually brought me here, though, that brought me to Harness in the first place, and it feels a long time ago now, I was a software engineer for 20-plus years, and then put my first product in the cloud, which switched me over to DevOps because I had to operate it. Then, over the following 10 years, we took this journey of putting another 40 products in the cloud. That DevOps became platform engineering as we built out the platform for those products to be deployed at. It really changed my trajectory because a lot of what I was doing was helping engineering teams really adopt a platform and push it forwards and be able to use it easily. When I had the opportunity to come and join Harness, to just spend my entire time helping teams do that, that was perfect for me.
Mike Fielder: I’m Mike Fielder. I live in New York City in the United States. I work at the Python Software Foundation as the sole PyPI safety and security engineer, helping secure the Python package ecosystem and registry from account takeovers, malware. This is a wonderful spot in open source that I’ve been doing for the last couple of years. Over the past 30 years, I’ve been working at a variety of startups and enterprises, from 6 people around a table to 50,000-person corporations in the field of systems administration on to DevOps or platform, and have managed teams, have been a contributor on these, and just fluctuated back and forth in the last couple years. I’ve very much been enjoying working in the open-source space. I’m excited to talk about some of the experiences I’ve had in platform engineering, encouraging adoption, and maybe some of the failures.
When to Build a Platform vs. Use External Tools
Renato Losio: Discussing actually adoption, I’ve heard probably 10 different definitions of platform engineering from 10 different people in my life. Still today, I’m not 100% sure that I know what I mean by platform engineering, but we’ll get to that. I’d like actually to start with a bit more of my main challenge when I start to work with and discuss platform engineering is, when do I know that I’m ready or I need to build a platform versus just using some external tool or do nothing. Do you have any advice?
Mike Fielder: You nailed the big question on the head is, what is platform engineering? The definition will change. Any one of us will give you probably a slightly different definition. Same thing with DevOps, or system administration, or any term that our industry has not codified in stone.
Generally speaking, my understanding of platform engineering and the teams that I tried to build were at a scale at which the software engineers, developers, and product owners in the company are feeling constrained by not being able to focus on parts of their pipeline delivery in order to execute the company vision. What do I mean by that? I mean, if you’re the sole developer at a company and it’s a small company and nobody is pressuring you to deliver 5,000 new features a week, you can afford to own your infrastructure and build out, I want to deploy this service on this server this way. You add another person to that team, now you have to coordinate and you have to agree upon how you do things. You have to codify that. Add another two developers to that team. You have to keep more people in synchronization. You have to keep more systems in synchronization as your footprint grows.
The adage of the platform engineer is basically another layer of abstraction between the software developer and the compute that they need to run on, or the resources that they might need to run on. It’s that old adage of, you can solve any problem in software with another layer of abstraction, except for too many layers of abstraction. I think it’s when you’ve realized that it is time to abstract away some of these concerns or competencies away from people who maybe aren’t the best at it or have no interest in it and have interest in other areas, that’s when you want a platform engineering team to manage that abstraction layer to enable those other folks.
Cat Morris: There’s another layer on top of this as well around the common capabilities in your organization. It might be that something’s very painful for that team, but they’re the only team that are going to do it ever. At that point, it’s probably not worth it to invest in a platform engineering team because then you’re just essentially making that one team larger. If you’ve got a bunch of teams that are doing something very similar and isn’t really adding to that end user customer value, that’s when platform engineering teams can really shine.
Renato Losio: What’s your experience with that? You mentioned quite a large team. When did you start to think about that?
Boyan Dimitrov: When we had a problem. For us, we didn’t start with, what is the latest buzzword around DevOps and finding out, platform engineering, let’s do that. At SIXT, we serve, as I mentioned, customers in over 100 countries. That comes with a lot of different expectations, starting with how our customers want to be served and what kind of experience they’re seeking. Also, as you can imagine, 100 countries bring a lot of different compliance and governance complexities in terms of what you’re building, how we’re integrating it, and that’s all something that our engineering teams had to deal with and focus in the past.
Plus, because they were running in a you build it, you run it mode, they also had to deal with how they deploy their applications. That was breeding fragmentation that was basically causing a lot of different technologies to be put in the mix, essentially doing the same thing, which they also had to take care of. We wanted to solve for focus, and we wanted to solve for improving, in the end, our customer experience, and focusing on our internal business. Therefore, we started investing in reducing this fragmentation, figuring out, which are basically also, as Cat mentioned, the common denominators there between all those engineering teams. That’s how we started our platform engineering journey, by solving this problem.
Then we saw that we’re getting a lot of extra benefits from that direction. We started doubling our bet, increasing its footprint, moving beyond what traditionally today, when we speak platform engineering, most people take a combination between backend deployment platform, plus all your CI/CD bits, plus maybe some back-office frontends to manage it all. We ventured into web, into data, into mobile, and full-blown internal back office, hundreds of different internally facing back-office applications, not only technology facing, but also business user facing. So far, it seems to be paying off pretty well.
Martin Reynolds: There were some words in there that I picked up on specifically, because I think they overlap. Often, my experience has been that complexity in individual products and services is definitely a driver, because you have teams that are spending time maybe building infrastructure, deploying, that is often a driver for like, we need to reduce the complexity so they can spend time building products, not building infrastructure. I think that comes up quite often. I also think those commonalities, that when you start to see that you have more than one team doing the same thing, but all doing it differently, that is not a great use of time, because, essentially, you’re reworking. Taking almost that product style view of saying, here is a common thing that we are doing, then the outcome is a common outcome, but how do we do it in a way that is easy to consume and makes it easy for the engineers, so that they can focus more time on building products? The bigger the organization, the more those complexities show up.
Generally, people want their product teams to be building products. That’s where really you start to say, how do we take away some of those pains to help them really drive a shared experience? There’s a whole other piece as well that I would add on the back, which is, the more complexity and variation that you have, and it doesn’t mean you can’t have variation inside your platform. The more variation that you have at the backend, that’s the more management that you need to do. You don’t always get the economies of scale in terms of costing and the savings that you might do if you consolidated some of those approaches. There’s a whole thing about adoption that I hope we’ll get to later. Those are all drivers.
Often, seeing how much time is spent in those different things is a really good way to say, this is something we should probably pull out and do in a shared way. Here is another thing that we should pull out and do in a shared way, because the overall benefit outweighs any individualism that the teams might want. You want to have this freedom with guardrails approach, I think, which would be my generic approach to enabling a platform.
The Sweet Spot with Platform Engineering, Across Different Size Orgs
Renato Losio: I read an interesting article from Google that mentioned how they do platform engineering at Google. One of the advice they had was like, yes, tailor platform engineering for each business unit and application to achieve the best outcomes. I thought, that’s pretty cool for Google. I was wondering, in my day life, working with companies that are at a scale that are much smaller, when I think about a company of like 50 people, 50 developers, 50 practitioners, or whatever, where do you find the sweet spot? Because Boyan, for example, mentioned his case, and that’s already quite a big one. What’s your experience, for example, playing with companies, because I don’t want to build a tool for every developer, more or less?
Mike Fielder: I think that’s absolutely an interesting perspective, because we all read articles written by very large companies, by like Google, and say, we should do what Google does. Then we forget that we don’t have the money that Google has. We don’t have the personnel that Google has. We also don’t have the constraints or demands that Google has. I think a lot is often lost in contextualization of what these solutions are, and whom they are for, and when is the right time. When Martin talks about reducing duplication, or Cat mentions getting product outcomes to users, I think that comes back to, what incentive does this company have? What are they trying to do? If you’re trying to be in compliance, then you need a compliance team. That’s where platform engineering can very much offload a lot of concerns by saying, we’ve figured out a way to maintain compliance via software development. Do these steps in our process, and you will be in compliance.
If you avoid these steps, you are now non-compliant. That’s one concern, and that’s an incentive. For a startup that maybe hasn’t reached levels of compliance that they are necessary for regulation, or are electing to defer that to a later date, may choose scale, or developer velocity, or deployment velocity. We often think about the cloud as this utility function. We get compute that historically I had to order from a hardware provider, wait six weeks until it got to my data center, rack and stack it, and then at the end of all of that, I would still plug it into power. That was a utility from somebody else. I didn’t have to manufacture my power. I didn’t have to manufacture my air conditioning. That was a utility function. Now the compute part is a utility function that many of us take advantage of by saying, I’d like to make an API call, and I get servers instantaneously, and I can continue to scale.
Platform engineering is another concept that could be part of this utility function. When software developers inside of a company say, I have an idea, I’d like to throw it in front of customers in order to validate whether this idea is a good one or not, or do people like it, the faster they can do that, the better, because then they can learn whether or not this was a good idea or a bad idea to people, like, can we make money? Are we reaching our incentive? Turning platform engineering into a utility function of an organization gives them a leverage of scale that an individual team doesn’t have to consider, how do I deploy a new idea, a new application in front of customers very quickly while maintaining compliance and security, and having cost management and all the other parts? I have a platform engineering team that helps me do that. That’s a lot to take on and a lot of responsibility for a platform engineering team.
I think starting at a smaller level, you look at, what are the friction points and incentives in the organization, and say, what can I save here? Where can I get leverage? Do I need platform engineering team to also own end user authentication, so it’ll be an internal SSO service? Then federate those between services, so services and teams don’t have to invent one or have security concerns. It’s, just plug in to the thing that we already have and you will probably do fine. I think there’s a flip side of that of folks saying, I’m bespoke. I’m too special for the platform engineering. You have a generalized solution that doesn’t work for me. I think that’s a valid statement that always has to be backed up with some evidence and say, yes, just because you don’t like it doesn’t mean you get to opt out. If you do have distinct concerns or constraints that a platform engineering team is not yet able to provide you service with, then that’s a, ok, I’m diverging from that.
Deploying Platforms Based on Cell-Based Architecture, and the Cost Factor
Renato Losio: What role will deploying a platform based on cell-based architecture play, to perhaps achieve product versus cost effectiveness? Does anyone have any advice or wants to comment?
Martin Reynolds: That cell-based architecture, I don’t think it specifically breaks from a platform, and it depends what you’re calling your platform and your landing zone. Because, my experience was housing multiple products in different verticals with different compliance requirements, things like healthcare versus education versus finance, they all had slightly different regulatory requirements, auditing requirements, and those kinds of things. However, just because you have a platform, it doesn’t mean that you can’t have those isolated type cell-based architectures. You can still have them isolated, they can still have their own CI/CD pipeline, even though it might be based on a templated pipeline.
For example, for CI/CD, you might say, here is the template to deploy to that architecture. There are some places you can expand on it, but it’s your implementation of that template. You have your commonality, but you also have your uniqueness. I think those two overlap. In terms of costs, generally when you’re doing something at scale, you get the benefits of scaled pricing. If you’re deploying lots of services in similar base architectures, whether that be Kubernetes or serverless functions or whatever that is, when you’re doing that at scale in a common way, that means that you can drive those discounts and the cost savings and everything else.
As long as you have a really good way in your platform to be able to say, this cost belongs to this product, then you can really show those two and say, you just added this ridiculous queuing system into your thing, and it’s added this much cost into your deployed item in the platform because you decided that you wanted to have something that was huge. I think there’s a balance there, but generally a platform would give the right visibility and everything else, and still give the freedom for those cell-based architectures while being compliant with those targeted landing zones.
The Common Mistakes When Starting with Platform Eng
Renato Losio: What’s one thing you would have done differently when starting with platform engineering? What’s the most common mistake you see in people starting with platform engineering?
Cat Morris: I see it all the time, and I used to do this all the time myself as well, and that was starting with the day-0 experience. By that, I mean deploying something brand new for the first time. I have never seen a platform engineering team be spun up in an organization where they are starting with nothing. There’s always a team that’s doing something already. You’re already deploying things in some way or some form. It might not be scripted or it might not have CI/CD, but you’re already starting from somewhere. How often are people actually spinning up new things? Maybe you’re a super-amazing organization that’s creating a new product every week and flinging it out there. Chances are you aren’t, though. I’ve not worked with any of those, but I work with a lot of insurance companies and financial services organizations where there’s a lot of stuff already.
Thinking about maybe a slightly different way of, how do I upgrade and maintain the things that already exist? Is there a way of getting those services onto the platform with some abstraction layer in between so that you can carry on doing what you do in a way that makes sense to you as a development team? Under the hood, you’re now more compliant and that’s owned by the platform itself. You’re treating it like a different model of how do you upgrade and move in the future. Why this has gone wrong so much for me is you invest all this money and effort into a use case that happens far more infrequently than those change cycles. It just meant that we didn’t have that ROI for the platform team as effectively as if we’d thought about, how do we upgrade? How do we patch? How do we maintain these things going forward? How do we consume more of those existing services onto the platform?
Renato Losio: Did you start from scratch as well?
Boyan Dimitrov: It’s always a blend. We did a lot of things from scratch, but we also took with us things that used to work well. I do not, because I do agree that not defining the ROI in the very beginning is also one of the biggest mistakes I keep seeing as I talk to folks. I would add to that, not getting top management, executive leadership involved on this mission with clear measurement of what success and what failure means. It has happened to me so many times when I attend some of those more open roundtables, I would listen to a story that usually looks something like this. In company X, we would have this issue with complexity, every team doing their thing and that’s slowing us down.
Therefore, we got this mission from a VP of engineering or the CTO to create this platform team, which solves it. Solving it would mean, as an example, we will be running 10 different message buses in this company. We’ve agreed that we’re going to build this 11th, which is going to become the one and it’s going to solve it for all. That’s the mission of the team. That mission is only known to this team alone. They start building something and there is no step two. There’s literally no step two.
Then, in our discussions, it will be, guys, how did you solve for this problem? I would usually go with, that’s the wrong way to begin with, because there is no well-defined success here. Everybody is busy. Business teams in big companies, especially, they’re not slow because people are just slow in execution. They just have a lot of things to do. Migrating off whatever they’re using, it has been working for them in the past, is also something that has to be planned and it has to lead to a bigger outcome. In other words, this needs to be part of a common company mission. We’re doing X, the ROI is clear, and we’re all on this boat and we’re doing it.
Otherwise, what tends to happen is a well-intended platform team or whoever you call it, ends up with building just the one more fragmentation into a very fragmented landscape already. There is zero adoption. Maybe you’ll have one POC or two POC teams, which were gracious enough to test your technology, but there is no strategy how you go from 2% to 100%. That’s when a lot of those initiatives die, because within a year or two, somebody checks, what’s happening with this project? Nothing. Let’s kill it and start from scratch. I’ve seen that so many times.
Martin Reynolds: I love this. I genuinely have seen some really great examples of this. When I was listening to Cat, too, that some of those things she’s saying just massively resonated. Worked with a couple of organizations, one of the largest banks in the UK, and they took a day-2 approach rather than a day-1 approach to designing and building their platform. They tried to take that product management approach too. They had the management sponsorship from above, but they also needed to get groundswell from below. They engaged with a lot of the teams, they talked about what they needed, what the outcomes they needed were, and they built the platform with them. They had these very specific targets, we’re going to get 500 of our 9,000 engineers on version 1 of the platform because we’re going to meet their outcomes. They spent lots of times doing communications with the engineers as well, “This is what we’re building. Are you excited by this? Does it meet everything you need?” Really applying those product style things to it.
Then that became, now we’re going from 500 to 2,000, now we’re going from 2,000 to 5,000. Now we’re going from 5,000 to, we want everybody on this. That didn’t mean that they didn’t have everybody deploying to exactly the same target infrastructure, because they’re a big bank, they had a lot of target infrastructures. What they had was a common platform that allowed a good audited way for them to get to production.
In a bank, it really does have to be audited and meet lots of stage gates. It really has to go through all those processes. What they did is they did it in a way that they had the sponsorship from above and they were removing the blockers, but they were also building the groundswell. They were building the adoption. They were building, how do I get my app onto here? I’ve seen that done even when I was still working personally in that side. We were literally like, ok, we’ve got like 40% of our apps that are deployed to something that looks like this architecture. How do we build some automation that says, if you can tick box 1, 2, 3, 4, 5, we can onboard you onto the platform?
That’s where we went so that we had a path for them to get there. You have to clearly articulate those benefits. We’re going to do all the security scanning that you need. We’re going to make sure you have all the auditing that you need. We’re going to make sure that you have silent deployments. Actually, we’re going to make sure you can have canary deployments or blue-green deployments. They’re an option that is available to you in that platform. You don’t have to build them, but you just have to meet these five criteria, if you’re deploying to EKS on AWS as an example. Then we just extended out those landing zones and repeated those processes.
Addressing Resistance to Change/Refactoring
Renato Losio: How do you address the issue that some teams remain strongly attached to past work and are resistant to rework or refactoring? That I think goes even farther than platform engineering. Any advice?
Mike Fielder: This is like into the depth of human nature and how to influence change that you want. I am not a behavioral psychologist or anything like that, but having managed lots of teams and gone through a bunch of different experiences with change management, I think one of the biggest learnings, and this also dovetails into that previous question, what do I wish I had done differently? I think understanding why you’re doing this is key. Very often, there’s a platform engineering lead and then there’s some platform engineer contributors, and the contributors are doing something for some developers that are like three steps away and may not have direct interaction or influence. Getting crisp on why it is you’re doing something and having that written so that everybody can get on the same page, is important to also then turn that around and look at other individuals in an organization and say, “This is what we’re doing. Does this not match with what you want?
If this isn’t what you want, then let’s talk about what it is you want”. To somebody who is attached to the old way, I’ve always tried to maintain like, the only constant in an organization is change. If you stay still, then you will fall behind. We are going to change. It’s, how are we going to change and what are we going to change? When it comes to folks who aren’t happy or excited about change, then I start to pose the question of, what is it about the change that you don’t like? What is it that is holding you back? Are you perceiving this as harder, more work? Then maybe I need to do a better job of storytelling of, by adoption of this practice, you no longer have to do these 10 other things that you used to have to do, and finding the value. For a team that are resistant, maybe their way is better. Maybe it is compliant. Maybe they can achieve those goals in their manner.
Then it’s not on the platform engineers anymore, it’s on that team to prove that they are compliant and within the company’s orders and regulations. Other than that, if they still are maintaining non-compliance, it’s no longer the platform engineer’s job. It’s now an HR problem, and these people are no longer doing what the company wants to do, especially if the platform engineering team does have that executive level buy-in of here’s what we’re trying to do.
Martin Reynolds: I feel like it’s the hotel and the house thing. It’s like, we in the platform engineering team, we’ve built this amazing hotel with all these different kinds of rooms that you can come stay in. We’ll take care of all the utilities and the power. You don’t get to choose to paint the walls pink or whatever it is that is your particular thing; however, you can choose not to be on the platform, but then you’re building a house and you have to do all the foundations. You have to make sure that you have all the power and light and water. You have to make sure it’s stable and it’s not going to fall down, and all of those things, and they’re your responsibility as a team. Just so we’re clear, you can do that. You’re not going to get extra people to do that, but you’re responsible for all those things, versus, come stay in our nice hotel. Yes, if you build the house, you can paint your walls pink, and that is absolutely fine if that is your choice, but you are responsible for the build and maintenance of that.
The way I’ve seen that played out in organizations, especially product-led organizations, is generally product has a certain amount of resource and people, and it’s like, you could invest three people from your team to building this house, or you could have half a person that’s responsible for making sure it can be deployed to the platform. Do you want to build more product or do you want to build your house? You have that balance, because then what happens if they don’t do that, then they’re not getting all those nice new features out as quickly as the product teams that are.
Mike Fielder: I love the hotel-house metaphor. The part of that metaphor that breaks down for me is that the person coming into the house or the hotel, the decision, they actually have the ability to influence the hotel maker in this analogy of, I’d love to move into your hotel, but I need the walls to be pink. As opposed to your traditional hotel that’s not going to take your request at face value at all. Inside of your organization, because it’s not a third-party utility company, you do have the ability to influence and discuss with your platform engineering team of here’s how we’d like to shape the product.
Boyan Dimitrov: Maybe I add just one other dimension on this. Collaborating on that level and discussing the colors of the walls is in some cases worth a discussion. If you do start this mission with an intent and you’re solving for a real problem, a good escalation has always helped in my experience, because there are certain levels where we could argue to death if one likes chocolate or vanilla ice cream, and we will not settle it.
A platform engineering team will never settle that with certain other product engineering teams. That’s just the facts of life. In my experience, the fastest way to resolve it is a leadership intervention, which usually goes with, in this company, for the greater good, we are going to go with vanilla ice cream and Mandalorian style. This is the way. You save months of back and forth and discussions and meetings, because in the end, often enough, they’re just a person in the room who can make that decision. There are always going to be certain flavors of expectations and options and small roadblocks.
For me, as long as we’re not talking about the fundamental departure of the experience or what this team fundamentally is trying to do, but it’s really talking about, what kind of colors we’re going to pick for this room. This is not worthy of continuous discussions and months of time lost. It’s more worthy of quick escalation and hopefully resolution. If the organization is set on that mission, I think it always works pretty well. On the contrary, if it is not, then that’s one massive roadblock of many to come, in my experience.
Influencing Change from a Platform as a Product Perspective
Mike Fielder: From a product as a platform perspective, how would you deal with that? Because that feels like it’s irrelevant, because, generally, if you’re not treating platform as a product, then you’re probably missing a trick.
Cat Morris: When I’ve built platforms, I tend to focus on particular archetypes and particular problems that I’m dealing with. It will be certain sections of the business. If people aren’t super interested, that probably means that the problem you’re solving for isn’t super painful for them. That’s ok. It depends on what thing you’re building at that point in time, and the real ROI that you’re trying to see out of the platform, and the real change, and the benefits that you’re trying to get. Sometimes I’m ok, depending on what the mission is. I also like to remind people, and maybe this is a bit product-y and a bit less technical, is the human psychology behind it. There are lots of different type of risk takers in a business. Your early POC customers are going to be those early adopters who are happy to give everything a go and hack it together. If it breaks, they’ll give you feedback or try and fix it themselves, if you’ve got some inner sourcing model. It can be very powerful to focus on those early.
Trying to make every development team in your organization fit that profile is just going to lead you to a world of pain. Maybe they need to see more evidence in that your platform is working for a particular team. We do this a lot in marketing and things like that, as well as you need proof points. You need customers showing that they’ve used your product and shouting about the benefits they’ve got for someone to feel confident in it. You can’t just say like, you have to join the platform.
The benefits of why they’re going to join the platform have to be there too. If those teams are like, “Those benefits don’t matter to me. I’ve got Sarah over here who knows compliance really well, and she’s just looking after it for me. We’re compliant as a team”. Maybe that’s ok for them until you find a pain point that they’re willing to address. Or there’s sufficient number of people, your product is so mature that you’ve moved into like maybe from the early adopters to those late adopters. That’s when you can see a lot of benefit too. Sometimes it’s just a bit of patience and a bit of time. Often, that gets forgotten in these numbers. You’re trying to be too aggressive with your targets.
When is a Platform Considered Successful?
Renato Losio: I was thinking in general, until now we have mostly agreed on everything. I was wondering if there’s any difference, like for example, how do you measure a successful platform? Do you have any specific metric that you look at, long-term, apart from adoption? How do you evaluate? Because you might have some skeptical developers. We say, you might not want to adopt it. You might prefer the wall to be green versus yellow. You might have as well the one that, yes, it jumps on the banner. He has to do it, so he does it, but maybe his velocity or his productivity went down. How do you actually measure the success of a platform engineering project or a platform in general?
Cat Morris: I start with my big failure. Unlike my second product management gig on a platform, we were trying to figure out this and we went through all of the NFRs or non-functional requirement types, and essentially picked two or three metrics from each of those areas of like, what would a secure platform look like? Let’s put two metrics of that. What would a scalable platform? Let’s put two metrics of that. It absolutely failed because none of those measures moved at all, like zero improvement in any of these. I think we had like 24 different metrics that we picked that we thought were important because the leadership couldn’t agree. They were like, this is important to me. This is important to me. It’s all important, therefore, we should measure it all. We should judge the success of the platform based on all of these measures improving, which means none of them improved. I know you said like, is it just adoption? For me, yes.
I think until you reach a certain level of platform maturity, just getting people onto the platform and just getting them to trust it and to use it and to see value in it, I think that can take you a long way. Because often those teams themselves will be thinking about those other metrics around, is it done fast enough? Am I secure enough? Am I reliable enough? If you don’t hit those things, they won’t join. It’s almost like that North Star, which is something that gets spoken a lot about in product management of like, what is that one thing that you want to see change? What is that behavior that means that you’re being successful? I think adoption is one of those key behaviors.
The only downside, you have a limited number. I like to remind people this on platform teams is that you are not Google. You don’t have 2 billion people using your platform. You have the 2,000 engineers you have in your organization. At some point it does dry up and become less effective, and then tends to be those boring business metrics of, is it saving us money?
Renato Losio: I’m actually happy to know I’m not the only one dealing sometimes with projects that have hundreds of considered KPI. How can we keep KPI when you have not two or three, but hundreds?
I don’t know if you had the same experience, if you measure it in any other way, or if you just measure adoption in your case? Do you have any advice in that sense?
Boyan Dimitrov: For several years, now, we are at 100%, so we don’t measure adoption. There is nothing but the platforms. The one and only metric which is evergreen is availability, because if that starts dropping for certain parts of the platform, obviously big issues.
On the other side, we obviously look at that point at the different experiences for the different parts of the platform, we try to optimize based on the feedback that we get. Within that, we basically combine multiple approaches. For sure, certain pieces are part of our NFRs. We also go for specific dedicated service, which we vary, quarter to quarter, in terms of getting feedback from the different teams. What would they love to see in our developer experience improved? What they would love to see in our observability platform improved? We also mix that with deploying some of the platform engineers to spend time with our prod engineering teams to really understand and see for themselves how some of those more complex teams with more versatile environments are using the platform parts, and also try to work with them to see if there would be a hotspot of bigger optimizations down the line that we want to prioritize. It keeps going. As Cat also explained, it becomes as any other user-facing product. You’ll be looking at the number of metrics, obviously ROI for sure always there, always constant. We want to optimize more and more.
On the flip side, we also measure what is the satisfaction of the different engineering teams. Then, as I mentioned, we flavor with, let’s pick up certain pieces where we are either thinking, do we go left or do we go right? Or, let’s see what are the biggest opportunities ahead of us and see with a couple of POCs that we’ve prepared, how is the adoption of those new features? What is the early feedback? In any organization, regardless of the size, you always have people who are gravitating closer, willing to give a lot more feedback. Then, you want to keep those guys close so that you can experiment with them and work together with them to decide on the next steps. That’s more on our level. Before that, it was also for sure for us adoption, the main metric as we were scaling up.
Renato Losio: You start with adoption, and when, of course, adoption is not meaningful anymore, you start to look at other metrics.
Evaluating the Cost Benefit of Adding Changes to an Existing Platform
How do you evaluate the cost benefit of adding changes to a platform currently used by a team?
Martin Reynolds: I think it depends. Does it mean adding changes to the platform or the value of the platform that’s being used by a current team? Because I think they’re slightly different metrics. Often, it comes down to time, which ultimately comes to cost. How much time are you spending doing something that could be saved if it was moved into the platform and became more consistent, and becomes not that team managing it and the platform managing it? Are you just transferring that work or is there a better way of doing it because it can be used multiple times? I think there is this evaluation of, I want to add something to the platform, but is it actually worth the investment, or is it actually just better that it’s just that one team that use it?
Then they can just keep doing that because there isn’t that value proposition of adding it into the platform. It depends, because if you’re lucky to have 100% of people on your platform, which I’ve never personally got to 100% on the platform, so I’m a little envious when I heard that. I think it depends either way, because there’s the cost of the infrastructure, whatever, but often that’s not going to change dramatically. It’s normally down to, how much time are you going to save by doing that? Are you just moving the time somewhere else or is there an actual true time saving by moving or consolidating those things?
The Choice of Solution Stacks, and Power Dynamics between Platform vs. Software Eng
Renato Losio: I think Boyan mentioned the topic of when you have, at some point, to decide between the blue and yellow wall, or whatever, you have someone to take the decision, so you escalate. I see now a similar question about power dynamics between platform engineering versus software engineering when it comes to the choice of the solution stacks. Do you have any advice there?
Mike Fielder: I think there is absolutely a power dynamic, and that often has to do with how your company’s leadership has set up the inter-team communication culture. Sometimes for the first platform engineering hire, somebody will be hired in and executives might make a big fuss, we’ve got somebody from platform engineering, they’re going to save us, and that puts them up on a pedestal and may start to coerce other software engineering leaders into following whatever they say is true.
Now, again, they might be doing the right thing. I think in every situation it is a collaboration, and understanding what is needed for this team to succeed is important for both the software engineering side and the platform engineering side. Because if the software engineering side of the conversation isn’t represented in the platform engineering’s roadmap, designs, execution, then there will continue to be conflict and people will be unhappy and nobody will get along, and ultimately the business won’t do as well as if those roadblocks weren’t there.
Going back to Boyan’s point, having a clear escalation path, or, even earlier, is having the platform engineering charter and purpose and purview very much carved out and saying, here’s what you get to decide on, here’s what you don’t get to decide on and need to escalate or collaborate, clarifies everything for everyone and reduces those contention points. We’ll find other contention points because we’re human, but at least those will be sanded away so that way they aren’t as sharp and teams can focus on expressing their needs and their why they need their things to each other, so that way they can both execute to better outcomes. There is a power dynamic and there always will be. I think it’s, how do you navigate that and express that as leaders in an organization?
Conflict Between Teams
Renato Losio: Has any of you actually experienced that, not power dynamic, but maybe conflict between teams, or anyone has oral stories to share?
Cat Morris: I’ve actually seen more conflicts between the teams that are using the platform about how that experience should look like rather than between the platform engineering team itself and one particular team. Often, you’ll have two, typically very influential, it’ll be like the most experienced or the most tenured engineers in the organization that are fighting head-to-head and expect the platform to behave a different way. That can be very painful. The tool choice is a good one, like, are we going to be MySQL or are we going to be something else? How we dealt with that was we just had different abstractions for different teams.
Some teams care a lot about exactly what fields should be provided and how big it should be and how it connects to things as a database, or any sort of thing that you want to provide as a service. Other ones are like, I just want a lot of space to store a thing in, I don’t really care about it.
Another might be like, I care that it’s NoSQL because I’m very important. You can provide those different abstractions and provide those different services on your platform based on priority in your team. Maybe they look the same and maybe you’re a little bit nefarious and they’re exactly the same under the hood, but the experience that the customer gets at the other end is fine. You’ve got to think about clever ways of dealing with that conflict as the platform rather than the fights between what platform wants and what the software team wants. I haven’t really experienced. Often, software teams are like, “We get it. You’re the platform team. You’re super overwhelmed. We’ll carry on doing what we’re doing already. You’re just adding a benefit to our lives”. They’re quite kind.
Low-Hanging Actionable Insights
Renato Losio: I’m thinking from the point of view of a practitioner, of a software architect that is convinced by the idea, I want to learn more on the topic or play more with platform engineering. Maybe he’s lucky that everything works and he has 100% adoption in his company and nothing to learn, and everything works, or be on the total opposite side. I was wondering if that’s something as a practitioner you can do in just a few hours, like let’s say half a day, 4 hours, whatever, like in terms of reading a book, say a podcast, watch a series, try something, build something. What would be your advice? I would like each one of you to give some final advice of what you can build.
Mike Fielder: This might be the most popular, but go talk to your co-workers and find out what pain points they have and what is slowing them down from execution, and see if any of their stories are the same that you have. Then, if there is alignment, if there is overlap, spend a little time thinking about how you might solve that, how you might smooth that out for the two of you. Because you might both be working on different parts of code bases, you might work on different application stacks, but there may be some commonality. If you only have an hour, talk to your co-workers, find out what their pain points are, write them down and talk to more co-workers, and eventually you may have some good ideas of material platform level improvements that you could do if you don’t have a platform engineering team yet.
Martin Reynolds: I like the, find a commonality with a different team. That’s a great approach. I think sometimes it’s, if you really want to get in there. Just spend some time to understand what platform engineering is, because it is many things, but there are commonalities. It’s shared infrastructure, common ways of building that, understanding the difference around the CI/CD pipeline tools that you might use.
There’s a whole raft of things under the covers that you can actually get a fairly quick view of to understand how all those things plug together to help make a platform. It’s like saying, I want to know what all the Lego bricks are that make up a platform. I’m going to go and have a look at some of the Lego bricks. I don’t think platform engineering is something you can build in a few hours, but I think you can start to understand the Lego bricks, and say, infrastructure as code, maybe internal developer portals, maybe CI/CD tools, maybe shared services like authorization or whatever, these are the Lego bricks. I’m really interested in this Lego brick. I’m going to schedule some time later to look at that one.
Then schedule some time to look at a different Lego brick, because I think it makes the conversations easier the other way then. When you want to engage with a platform engineering team, you have at least some understanding of those Lego bricks and the conversation becomes easier.
Boyan Dimitrov: The best thing a practitioner can do, especially if he’s new to the field and wants to learn more, is go outside and ask people with experience who have been through much, so many companies have done so much in that space by now: learn from their failures, learn from their successes. I think that’s years distilled in a one-hour lunch break, which could propel you and your team much faster than trying and failing on your own, and trying to reinvent the hot water. If I have limited time, that’s what I would do.
Cat Morris: I will build on those with maybe some practical locations for those areas. One source of great info is all of the PlatformCon recordings. There are loads of talks on there of lots of different topics, worth a look. If you’re on Slack, the Cloud Native Computing Foundation have a really good Slack channel with platform engineering as one of their sub-areas, and they have coffee meetups that you can sign up to. They’re very chatty about all of the things that they’re interested in in platform engineering, and it’s not just AI. Also, that group has published a white paper on platform maturity called the Platform Maturity Model, which is really good at plotting where you are on that journey, and they’ve got descriptions of each of those levels and what it looks like.
Then it can help you figure out as an organization, where are we on the platform adoption journey? Is it very early days? Is there actually a platform already existing in the organization that I’ve just not realized or called it that name before, and where might I want to go? There’s a whole trove of information in there. They’ll probably find people, like Boyan suggested, that have done this before and are very willing to talk about the successes, the failures, what’s worked and what hasn’t.
What’s the Ideal Adoption Target?
Renato Losio: Do we always have to aim for 100% adoption, as we mentioned before, or is it ok not to have it?
Martin Reynolds: It’s fine not to have 100%. I think it depends on the organization and what the overall goal is. As you’re maturing towards an overall platform, I think sometimes there are just valid business reasons or valid technical reasons why either the platform isn’t ready for that yet, or that outcome can’t be achieved, and that is absolutely fine. I think as long as both the engineering team or the product team and the platform team, they both need to be aware, they both need to understand, and they both need to know where their responsibilities lie.
Don’t make that decision in isolation, do it with the platform team. Make sure it’s understood by leadership too. I think if you get all those things in place, then, yes, it is absolutely fine. Because there is always going to be things that maybe the platform isn’t ready for yet, or the platform will never do, because it just doesn’t make any sense, and for the life of the product, it will be the product’s responsibility to maintain that. I think, yes, it is ok, but the visibility of the decision, and having that decision logged, and reviewing that decision regularly, they’re the key things to making that successful, so that it doesn’t become like a fighting point or a friction point down the line.
See more presentations with transcripts