Transcript
Grzesik: The core question that we wanted to start with is, how is it that some organizations can actually deliver software and do it quite well and do it consistently, and some can’t? We will not explore the can’t part, we will explore the yes can and how is it possible, and what are the shared experiences that we have with Wojtek in the places we worked at? The answer, usually, it’s not something on the surface, so it’s not trivial, which you probably know. One thing that we found to be at the core is something that goes on to culture.
The Invisible Engine of Success: Culture (Westrum’s Model)
Ptak: Speaking of the culture, we really wanted to talk to you about one of the concepts that really fits well into the topic, Westrum’s Organization Culture. We really wanted to use Westrum’s culture model. It’s well ingrained into DevOps as well. If you look for DORA and Westrum, you will find really good reading materials on the DORA website regarding the Westrum culture model. We really wanted to touch briefly on it. You can think of what kind of culture model are you in? Westrum spent his scientific career working in organizations that really work well, and that’s how also we know about his work in the DevOps realm. The model consists of three types of organizations. First is pathological. These are the types of organizations that there are several features of them that tell you that, actually, they’re really pathological.
One of them would be that, for instance, cooperation is really low between different teams, different departments. We know Conway’s Law, of course. Messengers, meaning the people, whistleblowers, they will be shot on sight. Meaning, of course, we will reject them, and so on. This is the type of organization that is most likely to die in the current world because, of course, collaboration between teams is very low. They do not innovate very well. The second type of organization that Westrum described was bureaucratic. This is the type of organization which is really rule oriented. We have rules and we stick to the rules, process heavy, bureaucratic heavy. Looking at the same features, you can guess what it means. Collaboration and cooperation will be quite modest. Messengers, they will be neglected.
Probably, if the rules approve, and they play by the rules, it’s fine, but they will be most likely neglected. Very important, of course, if there is an incident, we look for justice, usually, in this type of organizations.
Grzesik: Its responsibilities are narrowing, so that the scope that people take onto themselves is narrowing further down, so that eventually it ends up in the previous one where nobody is responsible for anything. Here there is still some, but not too much.
Ptak: There are different types of organizations that we really wanted to look into and why they’re successful. The ones that you really want to work for. Very high collaboration and cooperation. Messengers. We train people to be whistleblowers. We train people to look for opportunities to learn. We look for every failure as the opportunity for learning. We really train people to do it.
Grzesik: Organizations seek signals wherever they appear, and they want it. They do it consciously because they absolutely want to be aware of what’s happening, and they want to make decisions based on that.
Background
My name is Andrzej Grzesik. I’ve been head of engineering, principal engineer. I now build distributed systems at a company called LMAX. It’s an exchange that does high-performance Java thing, nanoseconds counting Java. I like building systems. I’m proud to be a Java champion. I also run a conference, Java User Group, speak at conferences. I like technology for some strange reasons.
Ptak: I’m Wojtek. I’m former CTO. I also had my own company. Now I’m an engineering executive working with Revolut for 2 years, almost exactly. I’m responsible for Revolut Business. If you use Revolut Business, I’m the guy to talk to about the bugs, if you have any, of course. I’m also co-host of a community initiative called CTO Morning Coffee, where we really want to train the next generation of engineering leaders.
Revolut has a family of products. You probably know retail. It’s very popular in England. As far as I know, we’re number one in England. Business is also, as far as I remember, the number one B2B solution. We also have an app called Trader coming very soon in a reshaped form. That also will be separate application and Revolut form a junior. We’re definitely experiencing hyper growth.
Two years with Revolut, we grew two and a half times since I joined. We have over 40 million retail customers. In the business itself, for instance, that’s almost 70% now, year-over-year, and really accelerating. That’s important for me, because that sets the context that I’m working in a company that really grows really fast and it’s actually accelerating the growth. At the same time, it’s a very rapid product development. Usually, teams will have at least several deploys per day to production. Lead time for changes, one of the DORA metrics, is usually way less than three hours.
Grzesik: When I joined Revolut, I was head of backend engineering. Backend was 120 people, when I left it was 400, in under 3 years. That’s quite a growth. I’m quite proud of the things that were there about all the examples about Revolut, what he is going to speak about.
How to Measure Culture
Ptak: Coming back to Westrum organization, there is actually a practical, a first thing that we want to tell you so you can recognize the type of the organization, how to measure the culture. Ask your team what type of questions to ask. From strongly disagree, to neither agree, to strongly agree. Ask the following question.
Grzesik: If you’re a leader, run a survey across all of your teams in all departments, and you’ll get signals. Those questions, this information actively sought, how do people rate it from 1 to 5, 1 to 7, whatever scale you want, something that gives you a range. Then, if you notice good spots, good, if you notice bad spots, maybe teams, maybe departments, maybe some areas, you will have places to begin.
Ptak: Hopefully you recognize Gene Kim, one of the people who really started DevOps revolution. He has a podcast, and we definitely recommend. There are two episodes with Dr. Westrum.
Conway’s Law and Scaling Architecture
With companies like that, let’s talk about Conway’s Law and scaling the architecture. As we discussed, we are at the hyper growth scale. How to scale? Usually what would happen is you see something like this. What do you see?
Grzesik: A famous picture of teams, complexity, services, whatever you call it is there. Something that’s not immediately visible is the amount of connections, and how do you get from one far end to another? That’s actually a problem that organizations, as they scale up, get into. That’s something that I like to call problem of knowledge discovery. How do we know what we know?
How do we know who knows what? How do we know who are the good people to ask a question about the service, about how the code is oriented? Who are the people who get approval from? All of those questions. What services do we have? How reliable are they?
Ptak: If I have an incident, why do I have it? How many services that I was dependent on are in the chain? Between the database and my endpoint, how many services are truly there?
Grzesik: If you have a payment flow that needs to be processed, which services are on the critical path, and so on. For the backend services, there was even a talk about Spotify’s Backstage at QCon. Backstage, if you haven’t been there, it looks like that. It’s a catalog of services which has a plug-in architecture, which gives an information radiation solution to the problem. That’s very nice and awesome because it allows services to be discovered, so people can know what services there are in the organizations. What’s the description? What’s the role? What’s the responsibility? What’s the materiality? Which means, what happens if it goes down? How important is it to business operation? What are the SLOs, SLAs? Aspirational and contractual expectations of quality? How do we monitor? How do we get there? Upstream, downstream dependencies.
Basically, what’s the shape of the system? If you have above some level of services, you want that, otherwise, it’s hard to fish out from the code. Anything else that’s relevant. Backstage solves it for many people. Backstage has plugins, but not everybody uses Backstage. What does Revolut use?
Ptak: Revolut has its own solution, and I’m going to talk about a couple of points which are important. It’s called Tower. It gives us technology governance, so everything is described in the code. It’s trackable. It’s fully auditable. It’s fully shareable. It looks like that, nice interface. I can go there, look for any service, any database, pretty much any component in the infrastructure, and get all of the details which we discussed. Including the Slack angles of the teams, including the Confluence pages with the documentation, SLO, SLAs, logs, CI/CD pipelines. I know everything. Even if we have this massive amount of services, I know exactly what to look for, the information. Regarding the dependencies, here it is.
For instance, I can get all of the dependencies. We’re also working on an extension which will allow us to understand also the event-based dependencies, so asynchronous. That’s a big problem in the large distributed system. That’s actually coming very soon. As a leader of the team, I can also understand all of the problems that I have by several scorecards. We can define our own also scorecards, so I can, for instance, ensure that I have full visibility. What teams? How do they work? How do they actually maintain the services?
Systems Thinking
Coming back to our example, what else do you see?
Grzesik: We have a beautiful picture and we have a system, but as we build, as we have more services or we have more components, we have a system which is complex, because the business reality that we’re dealing with is complex. Now that we’ve introduced more moving parts, more connections, we’ve made a complex system even more complicated. Then, how can we deal with that? There is a tool that we all very much agree that is a way to go forward with that, and that tool is systems thinking.
Ptak: Systems thinking is a helpful model to understand the whole actually system that we’re talking about, for instance, as a FinTech bank solution. Complexity, as Andrzej said, can come from, for instance, compliance, [inaudible 00:13:47]. Complication, it’s something we’re inviting. There are two important definitions that I really wanted to touch on. In the systems thinking we have one definition, which is randomness. Randomness of the system means that we cannot really predict.
Grzesik: It’s things beyond our control. Things that will manifest themselves in a different random way that we have to deal with, because they are part of our team.
Ptak: They’re unpredictable. Or we see that as a noise of data. As I said, there is complexity, which is there by design. For instance, onboarding, in the business we’re presenting in over 30 countries. Onboarding any business is complex by definition. We cannot simplify it. It’s complex because, for instance, you need to support all of the jurisdictions. You need to make sure that you’re compliant with all of the rules. That is very well described in several books. The one that we’re using for this example is Weinberg. Weinberg is a super fruitful author, so a lot of books. That one comes from, “An Introduction to General Systems Thinking”. He proposed a model where there are three types of complexities in our systems.
Grzesik: The very first one, the easiest one is organized simplicity. That’s the low randomness, low complexity realm of well understood things. It can be things that we conquer with grunts. It can be things that we conquer with non-stack, no known services, problems that we know how to solve. They are business as usual. They are trivial. There is nothing magical there. There shouldn’t be anything magical there. If we keep the number of them low, and we keep them at bay, they are not going to complicate our lives too much.
Ptak: If you make things more complicated, as you can see on the axis, so introduce randomness, you will get to the realm of unorganized complexity. You will get a high randomness. If you have many moving parts, and each of them introduce some randomness, they sum up, multiply even sometimes. The problem is that actually the system gets really unorganized. Our job is to make sure that we get to the realm of organized complexity.
Grzesik: Which is where our system organizes. Business flows are going to use this technology in a creative way to solve a problem. That’s what we do when we build systems, not only technical, but in the process and people and interactions and customer support sense of things, so that a company can operate and people can use it, and everybody is happy.
How Do We Introduce Randomness into Our Systems?
There is a problem, as the system grows, it’s going to have a more broad surface area, and that’s normal because it’s bigger. Which means there’s going to be randomness that is going to happen there, and then there is some randomness that people want to introduce, like having multiple stacks for every single service.
Ptak: Can I have another database?
Grzesik: Can I put yet another approach to solving the problem that we have, because I like the technology for it?
Ptak: Can I get another cloud provider? You know where it’s going. How do we actually introduce that randomness into our systems? How do we make our system complicated and therefore prone slower? Because you need to manage the randomness. From our perspective, as we discussed, we see three really important sources of randomness, where you invite the problems into your organization. The number of frameworks and tools that you have. If you allow each team to have their own stack, the randomness and the complication of the overall system, all of the dots that you can see connected, goes off the roof.
Grzesik: Then you have problems like, there is a team that’s used to Java that has to read a Kotlin service, maybe they will be ok. Then they have to look at a Rust service and a Go service, and then, how do I even compile it? What do I need to run it? That gets complex. If there is a database that I know how it works, I use a mental model for consistency and scaling. Then somebody used something completely different. It becomes complex. Then there is an API that always speaks REST. Somebody puts a different attitude API in there, then you have to model to. There is that complexity, which is sometimes not really life changing, but it just adds on.
Ptak: Another thing is differences in processes. A lot of organizations will understand agile as, let people choose whatever they want to do, make sure that they deliver. A lot of people will actually have their own processes. The more processes, the more different they are, the bigger problem we have, the bigger randomness. Same with the skills.
Grzesik: Same with skills. Both of those areas mean that the answer to, how do we solve a problem, or how do we reach a solution to a problem in our area, will differ across teams. That means that it’s harder to transfer learnings, and that means that you have to find two answers in any organization other than one, and then apply the pattern in every single place. If you have a team that follows DDD, you know that you’re going to get tests. If there is a team that would like to do testing differently, then the quality of tests might differ across solutions. What we are advocating, what we have experience of not doing, is automating everything.
Ptak: You start to take care of the things that are not important to your business. You start to use the energy of your teams not to build stuff, not to build your products, not to scale, but to solve the problems that are really not important to your business. As somebody said, actually, we should be focusing on the right things.
What’s the Revolut approach? Simplified architecture by design. We try to really reduce the randomness. Simplicity standards, they’re being enforced so you cannot really use whatever that you want. We enforce certain set of technologies. I’ll touch on it. It’s enforced by our tooling. We really optimize for a very short feedback loop. First, to talk about architecture. It’s designed, supported by the infrastructure, and our architecture framework, service topology. Every service would have its own type predefined. Every definition will contain how it should behave, how it should be exposed, what it should be integrated with. How does it integrate? If and how it integrates with the database, and so on? The important one will be frontend service, so the resources definitions.
Flow service, it’s the business logic orchestration. State service, which is a domain model. That gives us actually the comfort that we know exactly what to expect if you reopen the service. You know exactly what will be there, how it will be structured, and so on. Revolut’s architecture is super simple in this way. It’s really simple. It’s to the level where it’s really vanilla Java services with our own internal DDD framework, and deployed on Kubernetes. That’s all. We use some networking services from Google. Processes layer, that’s interesting. Postgres is used as the database, as the event store and event stream. We have our own in-house solution. Why? Why not use Kafka and so on, Pub/Sub? Because we know exactly how to deal with databases.
We know exactly how to monitor them, how to scale them. If you introduce a very important component to banking, such as the technology that is not exactly built for these purposes, you introduce the randomness, and you will need to build the workarounds around that randomness. Data analytics, of course, there is a set of the features. Here is an example, coming back to my screenshot. These things are enforced.
Grzesik: A stack of Java service, CI/CD has, what is it? Is it a template?
Ptak: It’s a definition, and we know exactly what to expect. The whole CI/CD monitoring will be preset. When you define your service, everything will be preset for you. You don’t need to worry about everything that should be not a problem for you. You’re not solving a business problem, but worrying about, for instance, CI/CD or monitoring. You need to focus on the business logic. That’s what we optimize for our engineers.
Heuristics of Trouble
Grzesik: We have information being radiated. We have things that are templated. We have a simple architecture. Then, still, how do you do it well? How can you answer that question?
Ptak: We wanted to go through heuristics of trouble. We really want to ask you to see how painful it is for you. We have some examples that you probably can hear in your teams. The first one would be.
Grzesik: Why do people commit to our code base? Have you ever heard it?
Ptak: That will be sign of blurred boundaries. The problem of no clear ownership, conflicting architecture drivers that lead to I don’t care solutions.
Grzesik: That’s this randomness that we mentioned before. It’s ok for people to commit to other services. Absolutely, it’s a model that the company I work at uses. I think your place also uses that. The thing is, somebody should be responsible. Somebody should review it. That’s the gap here. If somebody commits without thinking, that’s going to be strange.
Ptak: Another thing that you can hear in the teams. I wonder, when did you hear it? It’s them, whenever something happens.
Grzesik: “It’s them. This incident is not ours. They’ve added this. They should fix it”. If you connect it with Westrum’s bureaucratic model, this is exactly how it manifests in a place. In the grand scheme of things, if everybody works in the same company and everybody wants shared success of the company, this is not the right attitude. How do you notice? By those comments, in Slack, maybe in conversations, maybe by the water cooler, if you still go to the office.
Ptak: It’s a lack of ownership. We see the blame culture, fear of innovation, and actually good people will quit. The problem is the other will stay. That’s a big problem for the organization. Really, the same, of course, deployable modules and teams. That’s important also to understand regarding the ownership. Another thing that you may hear, let me fix it.
Grzesik: There is a person, or maybe a team, that they’re amazing because they fix all the problems. They are so engaged. They run on adrenaline. They almost maybe never sleep, which fixes the problem. You’ve met them, probably. The problem that they generate is they create knowledge silos, because they know how things work, nobody else does. They also reduce ownership because, if we break something, somebody else is going to fix that. That’s not great. Because of how intensively they work, they risk burnout, which is a human thing, and it happens. Then somebody can operate at this pace and at this level for maybe a couple of months, maybe a couple of years. Eventually something happens, and they are no longer there. Maybe they decided to go on holidays in Hawaii for a couple of months: this happens, a sabbatical.
Ptak: God bless you when you have the incident.
Grzesik: Then you have a problem, and then what do we do? We don’t know, because that person or that team is the one that knows.
Ptak: Very connected. You have an incident, and you’ve heard, contact them on Slack. The problem is, you have components which are owned by a central team, and only central team keeps the knowledge for themselves. It’s always like a hero’s guild. They will be forcing their own perspective. They will be reducing, actually, the accountability and ownership of the teams. They will actually be the bottleneck in your organization.
Another very famous, is you have a bug incident, you will contact the team and you hear, create a Jira ticket. That’s painful for all of you. That’s a good sign of siloed teams and conflicting priorities. It means that there is a very low collaboration. We don’t plan together. We don’t understand each other’s priorities. We duplicate very often efforts. How many times have you seen in your organizations, they won’t be able to build it, we need it, so we’ll build our own, or we’ll use our own, whatever. There goes the randomness. Another one that you may hear is that you ask the team to deliver something and they say, I need to actually build a workaround in our framework, because we’re not allowed to do it with our technology.
Grzesik: The problem that we have is not, how do you sort a list, but how do you sort a list using this technology, that language, using that version of the library, because it’s restricting on this database?
Ptak: That’s actually when technology becomes a constraint. It’s a very good sign that the randomness is really high. You are constrained by the technology choices that you made. There are probably too many moving parts.
Grzesik: Another aspect might be how it manifests. People will say that, I’m an engineer. I want some excitement in my life. I’m going to learn another library, learn another language. The purpose of the organization is to build software well, and you can challenge that perspective. People can be proud of how well, and last without bugs, execute software delivery, but it requires work on the team. This is a very good signal. Another one is, hammer operators. If you’ve met people who will solve every single problem using the same framework and that same tool, any technology that they are very fond of, even if it doesn’t fit or even if they made the choices for technologies before knowing what the problem really is, that’s a sign of a constraint being built and being implemented.
Practical Tips for Increasing Collaboration and Ownership
Ptak: Actually, there are some good news for you, so we’re not made to suffer. Practical tips from the trenches. How to increase collaboration and ownership.
Grzesik: We know all the bad signals, or what things we can look for, so that we know that something is slightly wrong, or maybe there is a problem brewing in the organization. The problem is, it’s not going to manifest itself immediately. It’s going to manifest itself in maybe a year, maybe two years down the line. Some people will have gone. Some people will have moved on. We will have a place that slows down and cannot deliver, maybe introduces bugs. Nobody wants that. How do we prevent it?
Ptak: First one is, make sure that you form it around boundaries. For instance, in Revolut, every team is a product team with their own responsibilities, very clear ownership, and most likely, with a service they own. I would recommend, if you know DDD, to go into strategic patterns and, of course, use business context. That’s very useful. The second thing is, there is a lot of implicit knowledge in the organization, make it as explicit as possible.
Grzesik: Put ADRs out there. Put designs out there. Don’t use emails to transfer design decisions or design discussions. Put it in a wiki. It’s also written, but it’s also asynchronous. In the distributed organization that we work with, that makes it possible for people to ask questions and comments, and know what was the error, and what was decided and why.
Ptak: If you’re a leader, I would encourage you to do ownership continuous refactoring. Look for the following signs. If there is a peer review confusion, who should review it, how we should review it?
Grzesik: How can you measure that? The time to review is long because nobody feels empowered, or the correct person to review it.
Ptak: Another one would be, hard to assign bugs. The ones that are being moved between teams. We do have it, but we really try to measure it and understand which parts of the apps have this problem.
Grzesik: I don’t have that problem because the place I work in does Spark programming. There is no need to do PR reviews. If you do pair programming, you actually get instant review as you pair, which is awesome.
Ptak: Of course, incidents with unclear ownership. Look for these signals. How to deal with situations where you really don’t know who should own the thing. There are a couple of strategies which I would recommend. Again, clear domain ownership. Then the second one, if we still cannot do it, is proximity of the issue. We can say, it’s close to this team’s responsibilities. They’re the best to own it.
Grzesik: Or, they are going to be affected, or the product that they are responsible for is going to be affected. Maybe it’s time to refactor and actually put it under their umbrella.
Ptak: Sometimes we can have central teams or platform teams who can own it, or in the worst-case scenario, we can agree on the ownership rotation, but do not leave things without an owner.
Grzesik: Sometimes something will go wrong. Of course, never in the place we work at, never in the place you work at, but in the hypothetical organization in which something happens.
Ptak: I would disagree. I would wish for things to go wrong, because that’s actually the best opportunity to learn.
Grzesik: It’s a learning opportunity. There is a silent assumption that you do post-mortems. If you do have an incident, do a post-mortem. Some of the things that we can say about post-mortems, for example, first, let’s start with a template.
Ptak: I know it might be basic, but it’s really important, teach your team to own the post-mortems. Have a very clear and easy to use template. That’s ours, actually. We have exactly, for instance, the important ones, the impact with the queries, or links to logs that I can use. We have several metrics to measure, so see how better we get to: so mean time to detection, mean time to acknowledge, mean time to recovery. We do root cause analysis, 5 Whys. The important thing, we will be challenged on that, and I will come back to that.
Grzesik: Also, what is not here is, who’s at fault? There is no looking for a victim.
Ptak: Because they’re blameless. We try to make them blameless.
Grzesik: It should be also accurate, which means they should give a story. It could be a criminal story or a science fiction story, depending on your take on literature, but they should give a story about what happened. How did we get there? What could we potentially do different, this is the actionable part, and we have to do it rapidly. Why? Because human memory is what it looks. We forget things. We forget context. The more it lingers, the more painful it becomes.
Ptak: We come back to Westrum pathological organizations. You can recall that probably that won’t work in such an organization. Couple of tips that I would have. Create post-mortems for all incidents. Actually, with my teams, we’re also doing almost incidents, near miss incidents. When we almost got an incident in production, amazing opportunity to learn. We keep post-mortems trackable. There is actually a whole wiki that contains all of the post-mortems’ links, searchable, taggable, very easy to track, to understand what happened actually, and how we could improve the system.
Grzesik: I also keep them in the wiki. If your risk team or somebody in the organization says that, post-mortem should be only restricted to the people that actually take part, or maybe they shouldn’t be public knowledge. Maybe you’re leaders, maybe you’re empowered in the correct position to fight it, or escalate it to your CTOs, this is the source of knowledge. This is a source of learning. It’s absolutely important not to allow that to happen, because that’s what people will learn from and that’s what influences people’s further designs.
Ptak: Drill deeper. Root causes, we actually peer review our post-mortems, and we actually challenge them. It’s a very good learning opportunity for everyone, actually. I would encourage this as a great idea.
Grzesik: A very practical attitude. Find the person who is naturally very inquisitive. It can be a devil’s advocate kind of attitude. They are going to ask questions. They are not going to ask questions when people are trying to describe what happened, but they will ask the uneasy questions. That’s a superpower, sometimes, in moderation. If you have such an individual, expose them to some of the post-mortems, figure out a way of working together. That attitude is absolutely very useful.
Ptak: Two last items is, track action items. The worst thing is to create post-mortem and let it die, or a bureaucratic, I do it for my boss. Celebrate improvements. It might be very obvious knowledge, but if you want to improve the organization and improve the architecture, so Reverse Conway Maneuver, actually, I would recommend post-mortems as one of the things that really teach people to own things and to understand them. May be basic for some, but actually very useful.
Grzesik: Systems that we write, they will have dependencies internally, they will have them externally. That’s something that we need to worry. Making it explicit, is knowing what they are. Then, that also means that you can have a look at, how is my dependency upstream and downstream doing? What are their expectations, aspirations, in terms of quality? Do we have circular dependencies? You might discover it if you have, for example, a very big events-driven system. If you’ve never drawn what is the loop, or, certain processes, which services they follow. You might get there. Then it’s obviously harder to work.
Also, if you know what the dependencies for you, which are critical are, then you can follow their evolution. You can see what’s happening. You can maybe review PRs, or maybe just look at the design reviews that people in those services do. Of course, talking to them. In a distributed, very big team, talking on scale, to an extent, which means RFCs, design reviews, architecture decision records, whatever you want to call it, same thing. DDD integration patterns offer some ideas here. Since I mentioned ADRs or RFCs, what we found working really well is very specific takeouts of doing them.
Ptak: Challenge yourself. We call it a sparring session or SDR review. You invite people who are really good in being a devil’s advocate, and you, on purpose, want to actually review your RFC or ADR, and make sure that it’s the best.
Grzesik: I will recommend The Psychology of Making Decisions talk, if you want to make your RFCs better, because it already mentions a lot of things that we could have included here, but we don’t have to because that was already mentioned.
Ptak: In the large organization with a lot of dependencies, there is a question on how to make sure that you involve the right people with the right knowledge. Of course, that can challenge you, because you might be dependent on the system that they know, and you want them to challenge you, if you take into consideration all of the problems in that system. What can help you? What’s the tool that can help you? It’s the RACI model. Use it also for the RFCs.
Grzesik: What is the RACI model? RACI model is grouping or attributing different roles with regards to a problem, to a category of being responsible, accountable, consulted, and informed. Who is responsible? The person who needs to make sure that something happens. A team lead, a head of area, somebody like that. Accountable, who is going to get the blame if it doesn’t get done and if it doesn’t get done well. Again, team lead, head of area, maybe CTO, depending on the problem. Consulted, who do you need to engage? Maybe security. Maybe ops. Maybe another team that you’re going to build a design with.
Ptak: These are your sparring partners.
Grzesik: Those people you will invite into an ADR. Then people who are informed are the people who they will learn what the consequences are. If they want to come, sure they can, but they don’t have to. Which means, if you’ve done a few ADRs using this model, know which people to invite, then you almost have a template, not only for the document, what an ADR should look like, but also, who are the people to engage and what kind of interactions you expect from every single group. Look at the benefits. What are they?
Ptak: Some of the benefits, of course, it’s clear, explicit collaboration and communication patterns. It really improves decision making. We don’t do maybes, but we know exactly who to involve and who to, for instance, consult with. It really facilitates ownership. It’s very clear who should own it, who should be involved, who should be communicated about the changes. I would encourage you to review it regularly. A very typical example from Revolut would be, responsible is usually the team owning a feature. Normally, I would be accountable. Consulted, we make sure that, for instance, other departments, other heads of engineering, other teams, or CTO, if it’s a massive change, is consulted and informed. It can be, for instance, a department or a whole company, so we know exactly how to announce any changes, for instance.
Signals for Refactoring Architecture
Grzesik: We know how to make awesome designs.
Ptak: We know how to execute well, and your architecture needs scaling.
Grzesik: What could possibly go wrong?
Ptak: You have the system, and we need to talk about the thing that we would call architecture refactoring. Once again, we’ve got heuristics of pain. The first sentence that you may hear, it takes forever to build.
Grzesik: If you’ve ever seen it, if people on your team say that it’s a slow deploy pipeline, so hours, not minutes, people releasing high percentage of red builds, those are the signals. What is the consequence? The feedback loop slows down, and also the time to deploy slows down, which means small changes will not get into production so quickly, which means there is a tendency to cool down and slow and be a bit bureaucratic, and maybe run the tests again. Maybe run the test again and again, because some of them will be flaky or intermittent, however you want to call them. That’s a problem to track.
Ptak: You may hear something like that. It’s Monday and people are crying, for any reason.
Grzesik: People who dread going back to work.
Ptak: That’s usually a sign. Simplification, of course. A system hard to maintain where simple changes will be difficult. It’s a very steep learning curve, onboarding curve. You need to repeat all of the code, possibly because you don’t know if you take, of course, spaghetti from one side of the plate, the meatball will fall on the other side. It’s typical in a system hard to maintain.
Grzesik: Who are these signals? Senior lead engineers, people who have been developing software, they know at the back of their head, it should be a simple change. Then they learn that something isn’t right, it actually is more complex. They’ve spent their third week of doing something very trivial. That’s a signal. That’s something that is very hard to pick up on a day-to-day basis, because we want to solve the problem. We want to get it done, whatever it is. Then we’ll do the next thing that we want to get done. Service will capture that. Or people we onboard, ask them after a couple weeks, maybe months, see, what is their gut reaction? Is it nice to work with? Is it nice to reason about? Do they get what’s happening?
Ptak: Another quote that you may hear, that there is a release train to catch. That’s a very good sign of slow time to production. We have forced synchronization of changes. We need teams to collaborate together to release something. That means that we have infrequent releases. Of course, that means that we’re really slow and we are not innovating quickly enough. That forces the synchronization of changes between teams, which means that actually they’re not working on the most important things at the time. Another sign and another quote that you can hear, we’re going to crash.
Grzesik: Performance issues, slow response times, frequent crashes. The number of errors in services. Of course, you can throw more services, spin up more, to work around it. It’s going to eventually lead to, hopefully, a decision of scaling the architecture. We need to scale, not add more regions, more clouds, but something different.
Ptak: How to scale and refactor architecture, tips. Of course, every organization is in this situation. The important thing is, when you have a large system, and of course, moving many components as we’ve shown you. There is a temptation that, of course, when you want to refactor something that you make a decision to, “This time we’ll make it perfect, every Greenfield project. This time, it will work”. Usually, what you really want to do is you want to review the CI/CD, for instance, the patterns, the infrastructure. We can now rebuild the whole thing and it will be shiny. The problem is, it doesn’t work.
Grzesik: What do you do instead? You might have heard of theory of constraints. You might have been doing software optimization.
Ptak: Let’s apply it. First thing is, you need to identify what is the biggest pain of all.
Grzesik: Some examples, tests are taking too long, CI/CD too long, and so on. You can definitely come up with more examples. You pick one, and then you try to work with it, which is formally called exploit the constraint.
Ptak: You focus everything to it. You ignore the rest, and you fix it, but to the level where you actually remove it, and you actually get even better. It not only stops being a constraint, but actually you remove it for a longer period of time as a constraint.
Grzesik: Then, the very important last element, you take this list from scratch, because the previous order will no longer apply, most likely. How do you then use it?
Ptak: There is a second approach that we can take. Let’s say you don’t have a very large pain point. It’s not like you’re crying, but you really want to optimize for something that you know that you will need, for instance. It’s called the fitness function. We really recommend to look for several examples. Can be, we want, for instance, builds to be 10% quicker by the end of the quarter. That’s how it works. You make a metric, you devote everything to fixing it. Then, you work on it. What you can do is combine them. Let me tell you about how we did it. Last two quarters, we’re working on Revolut business modularization.
The biggest pain was build time. It took over an hour for us to build it, over two hours to release it to production. For us, it was way too slow. It was massive, over 2000 endpoints, nearly 500 consumers, nearly 20 teams involved in the project. That’s exactly how we did the theory of constraints. We focused on the build time. Now every team has their own service. We reduced the build times by 75%. Massive. We optimized for one thing only. We haven’t, for instance, refactored the architecture.
The Cold Shower Takeaways for Uncomplicating Architecture
Grzesik: You’ve probably heard this, a bad system will beat a good person every time. If the organization you work in has shown it to you in history or previous past lives, it’s a learning point. Which means the question that we keep asking ourselves when we try to design how teams thinks processes work in the places we work at, is, can we build a system? Or, how do we build a system that supports people to do the correct thing and keeps giving the right signals?
Ptak: To build on the fragile system, because that’s what we’re talking about, learn nature. Nature has its own way. Apply stress to your organization, to your system, to your architecture. Look for things to simplify, unify, and automate. We gave you several tools how to do it. Learn from the nature. Actually, you want to apply stress. You want to apply stress to the architecture, to your teams. That’s very important.
Grzesik: Crucial element of that is short cycle time. Humans have a limited attention span, which means if the change and the effect is something we can observe, we’re going to learn from it. If it takes years or months from one to the other, it’s hard, and we’re probably going to do it less.
Ptak: We gave you several tips how to build organization growth mindset. Definitely look for them. They may seem basic, but if you do them well and connect them, they actually will lead to teams improving themselves to own better, and you will be able to do what is famously called the Reverse Conway’s Maneuver. Make sure that ownership is explicit. You really want to look for the things that nobody owns or the ownership is wrong, therefore refactor the ownership. We gave you several tips. That’s how you can work on the Westrum different factors.
Grzesik: If you do that, you can have a system in which a team, or a pair, a couple of engineers can make a change, can make a decision that they need to deploy to production and do it. Which means, if they notice that, they take the correct action at the correct lowest possible level of complexity. Then the learning is already in.
Ptak: A lot of companies would say, we’re agile, and we can do different things. I would say, don’t. Impose constraints. If you really want to be fast, if you really want to scale in the hyper scale fashion, scale up, impose constraints on the architecture, tooling, so people focus on the right things. Remove everything that is non-essential, so you reduce the randomness and the complication of the system. For the most important part, remain focused. A lot of organizations, of course, understand agile very wrong. For instance, on the retrospectives, teams are really good in actually explaining why they haven’t delivered. That’s the other side of the spectrum that we could be hearing.
See more presentations with transcripts