Extreme DevOps Automation - World Of Software

Transcript

Amorim: I’ll talk a little bit about our history in terms of the DevOps platform and how we have scaled for dozens of people, be able to support thousands of engineers, to be able to support thousands of systems to deliver value to you guys, the millions of customers. Hopefully, you are customers.

A little bit of context if you don’t know what Revolut is. Revolut is the financial super app, and you can do basically anything regarding money into it or area, that’s the goal. Just to give you a perspective of the dimensions of the things that we are going to talk about during this session, is at this point we have 50 million retail customers. People like you that use the application daily. We have 500 business customers. We are the top one financial application in 11 countries. We support 36 currencies. We operate over 160 countries worldwide. We are 10 years old. We are young, but we have impressive numbers.

This application or this functionality was created by, at this point, 1,300 teams, 12,000 employees, of which 1,300 are engineers. Not all teams have engineering capabilities, just for you as well to point out. We are a microservice-oriented company in the sense that we build our products around microservice-oriented architecture. We have also 1,200 microservices. Actually, the numbers are very related with one another. We have 1,100 databases. Also, we have other components that we need to take care of. Libraries, base images like either Docker containers or virtual machines, those are also included in the pack, the things that we need to maintain. To support all of these engineers and all of these systems, we have 15 DevOps engineers composed of two teams, and I’m leading one of them.

Professional Background

My name is Sergio Amorim. I’m a software engineer. I’m one quarter of the century in already in the IT industry. I’m driven by one passion, which is making life simpler for me, for the engineers, for the customers, whoever has a challenge that, in my perspective, IT can handle.

Product Team Needs

For us, for the DevOps, what are our client needs? What are the product team needs? We can divide this into three stages. The first stage is, I want to build my own application to serve a business need. I need somewhere where I can put my code. I need a foundation in order to build something into it. I don’t want to do everything from scratch. I want CI/CD pipelines to see if everything is ok so I can iterate super-fast, as we’ll see later on. We call this the development stage. Then we have the delivery phase. In terms of the delivery, I have my artifact. I have my system that I’ve just built. I need to install it somewhere. Where I’m going to install it.

At Revolut, our systems are primarily running in Kubernetes. You’ll need namespaces, service accounts, and all of the resources, deployments, stateful sets, things of that nature. You also need cloud resources like load balancers, Pub/Subs, things like that, or plain old virtual machines. Some workloads don’t run directly on Kubernetes. You also need NAT devices. We manage our own NAT devices, and we’ll check why later on. You also need buckets, because not everything is stored in databases. Some are just files that we need to store, recordings, things of that nature. You also need secrets, because we are interacting with vendors. We need to know passwords to reach them out, accesses through SFTP, or just plain old certificates. Why not having DNS also, because you need to reach revolut.com.

These are the typical things that engineers need in order to have the application up and running. Running also means that they need to operate. They need to see how the application is working, so logging. They need to react to deviations of the system, so alerts. They also need to visualize, to be proactive, how the system is behaving, so dashboards. One thing that is important, as well, is for us to understand if that particular area has cost benefits. They need to understand, as well, how much are the resources costing.

If you think about all of these three stages, there’s a lot going on in setting up the application already at this place, for the engineering team to produce something. Four years ago, it was something like this. The company is 10 years old, so we are just giving you the last 4 years’ perspective. Product teams had the ability to go into our observability tool. They had the power to go there, and they had to create their own alerts, their own dashboards, everything by hand. They have a lot of power, but this also means that they had to learn the tools, because we use third-party systems for that. They will need to learn how to set that up.

The thing is that probably the way that they were setting up was very different from another team, so a little bit of a deviation, as well, in it. You know what happens. People tend to copy-paste things, and one error just gets accumulated over time. They were able, as well, to create cloud resources or Kubernetes resources by going into Git and just doing a couple of pull requests, and things were great. Then they will run some jobs that would create those resources.

Speaking of jobs, and we use TeamCity for that, they would have to go into that platform and know how to deal with steps and all of that, and they would, as well, copy-paste their own jobs. A lot of effort for the teams. It turns out that sometimes they didn’t know how to do those things, so they asked for help. Usual channels for help are either service desks, or just plain old Slack, “I need help”. Us DevOps would go into those tools and do things around. Four years ago, it was ok. We had a small number of services. Then, we have an expansion, and we are still expanding to all of those countries that we saw with all of those numbers that you saw. In terms of underlying systems, we have grown four times, if not more. How can we deal with all of those manualities? We cannot.

The Change

What was one change? Let me go back a little bit to tell you what the vision is. We want the teams to be able to build and operate their own product from day zero. I have an idea, I want to focus on the business part of it. I want to build it. I want to ship it. I want to make value out of it on that particular day. With all of those systems, it would be impossible to do that, if I had to do all of that setup manually. For that, we have three pillars. A catalog of our system, and we’ll see how that plays around. We have well-defined patterns for it. Obviously, we have automation. Why? To leverage the catalog of the system, to instantiate those patterns and deliver those assets for the engineers to then use to deliver the business value that they want. For the catalog, let’s do one thing, let’s meet Tower.

Tower is our developer portal. It’s more than just a plain old CMDB, a catalog of the system. It’s a way for the engineer to understand how the system is, how it behaves, certain things around it. It’s a product that we built in-house. We leverage a lot of that information. Here, in Tower, we can see certain components. I’ll show you two, just because those are my favorite ones. We’ll see in a moment as well how they play a part in this presentation. Let me jump into one specific one, which is the alert-courier. Here, very traditional as any other CMDB. We have the name, certain descriptions, a lifecycle, nothing fancy. Then comes a couple of things that for us are very important, the kind of component, which is a service, and the stack. Here comes the part of the patterns. We have a limited set of stacks that engineers can use in order to build their own systems. This ensures that we have some repeatability on it.

Another important thing is that we know the ownership of it. There is a particular team that owns it and is responsible for it, so accountability. We also have other things like source code, where the source code is. We have other things that are super important for us, which is how this component plays a part in the ecosystem. To whom does it connect? Why does it connect to that particular system? What databases does it use? It doesn’t end on this CMDB part. We also know where it’s going to be deployed into. We have here, ok, it has a deployment in production in Europe, and another one as well in development environment. We have things related to the quality of it. Does it have enough metadata? Does it have enough maintainers? These are important things as well for the maintainability of the system.

We also have other important topics from the DevOps perspective, which is the availability of it. Things like, it is a tier 2, it must be 99.9% of the time up. We can see that it had the green state, but luckily, in the last week, it was all green. All good. We also know the cost of it. This component costs $1 or $1.80 a day. Nothing fancy, but I can see the progress of it. I can see the tenancy of it, and see if it justifies the business value or not. Then I can see the changes as normal. For us, this developer portal is the foundation for what we are going to build next. This is not built by the DevOps platform team. It is another team that exists within the company. We are one of the main stakeholders of it, and you will know why.

Data Utilization

What can we do with that data? If you recall, initial state, product teams were able to do things directly in the tools, some of them, and others, they would ask us to do it. Tower by itself might not have a lot of value, or it might have, actually. When we receive requests, even if they are, “I need to do something and you still don’t have automation for it”. We go into Tower and we check if that particular component exists, if that application is the right type, the right kind. It helps a lot in the review process. Why? Because since we want to enforce patterns, so even if the thing is manual, we can still see if it fits or if it will fit on the needs that we are going to have in the future. Obviously, we cannot stop there. This doesn’t help in anything in scaling out to that dimension. We created a set of tools. We call it Rhea. It’s basically an ETL system. It picks the information from Tower. We are a company that focuses a lot in configuration as code.

That particular Rhea, that particular ETL job, what it does is creates commits to the systems, and the systems, using a GitOps approach, consumes it and affects the end system. This part, we already alleviated part of the traditional work, the manual work that the DevOps team were doing. Basically, now the team can focus in another part of the system. Product teams still need to do certain things manually. What’s the next stage? One of the aspects that I’ve mentioned to you before was that teams require secrets, but the secrets are not free for all. I, as an owner of the service, cannot go into any of the secret management solution and fetch any secret. I need to have a policy. I need to have a control in order for me to be allowed to go into it. A couple of more concrete examples. If I want to connect to a database, I need to have the secret for that database, the username and the credentials.

In the catalog, I need to have a flow that says, yes, you can use that database. You can start as well to see the loop being closed here. The engineer goes into Tower, the catalog, defines that I have a connection, and with that, he has access to the system. It feeds back the loop. Using that information, we can actually replace one thing that he did manually. Let’s automatically create the policies for the engineer for that particular component. If, in my case, the alert-courier, the example that I showed you, has access to a database, I grant that automatically, and the engineer doesn’t need to do anything. One step towards the no configuration whatsoever.

Speaking of which, we know that it’s a Java application or a Python application deployed in a certain location. Let me create all of the necessary capabilities for him to create the Kubernetes resources, the cloud resources, the buckets, whatever he needs. Also see that the pattern that our transpiler or Rhea, our ETL job, what it does is just creates commits, and then after that, our GitOps system feeds into that and consumes it and affects the third-party applications. Nothing fancy here. What you did manually, you can still do it, but we are replacing the human part of it with an automation. That’s one of the key principles that we have that we are going to see as well. The learnings is that we are not doing any R&D research, any humongous discovery here. We are just building on top of what we have, replacing, iterating, and make it better.

Demo

Let’s see a demo. On the right side, I have the two Git repositories. One is the one that has the pipelines. Nothing fancy. No changes whatsoever. Here, we have Rhea, the transpiler that I was talking about. Over here, I have the representation of Tower in text format. You saw Tower as a fancy UI. We started way smaller. Just like in the previous presentation, the focus was, do right research and development, invest in what is needed. We started with a simple YAML catalog. It’s a YAML list of components, ownership, and that. Then, we evolved it to a UI. Let me have, by editing the text, the representation. As you can see here from the left, database is being created. Nothing fancy here, as well. The service, alert-courier, is being created. A couple of things to point out here. It has a name. It has a team. In this case, it’s the ID of the team. It has a lifecycle. It has dependencies. You’ll see how they will play along.

Including a database. I will run now the orchestration that picks up on this information and just computes the Git representation of what we would do over the systems. This was for the CI/CD. I didn’t present this in too much detail, but this is CI/CD. If I go here and do a git status, I can see that it created here a bunch of files, and these files are not less than things that TeamCity consumes in order to create the pipelines. TeamCity consumes this automatically without you doing anything as part of their solution. Again, names, Git repositories being added. We actually consume that information of the ownership, and then we compute what is the channel in order to raise when things go wrong, build or deployment action fails. Let’s see the secret part of it.

If I run the secrets, and go here and see as well the status, there’s one file change, the one that I didn’t check out before. This is in Terraform, so different one is XML, this is Terraform, but we export in multiple formats. If I do a diff, we see here that we have basically namespace, a service account, and the policies that it has access to. In our case, we actually use HashiCorp, so this turns into resources in HashiCorp and policies in HashiCorp, and the component now has access to these things. Nothing fancy. Nothing serious. It’s something that the engineers first did manually, we now do this in an automated way.

The More Complex Cases

However, like I mentioned to you, we started with a YAML format, then we move on to a UI approach. We know that sometimes the UI takes a little bit more time to react than just maybe a text file that can accommodate for that. During the session, I’m pushing a lot as well for patterns, but we also know that there are certain cases that the patterns don’t fit the need, either because it’s R&D or it’s a very special case, so there’s always a little bit of room for special cases. My recommendation to you is that you follow this pattern, is also leave a little bit of space for additional metadata, simple metadata that you control in which engineers can go there. It’s not a free-for-all. There are very predictive patterns to use it so that it can fit on that system, and tweak things that are then used to generate those files that the other systems use. Rhea at this point is our dispatcher to orchestrate several systems.

One advantage of this system is that if we want as well to change the provider, let’s say that for vault policies, we don’t want to use HashiCorp Vault, we want to use another thing, basically we adapt this Rhea to generate another template that that other system can consume and can interact with it. This adds a little bit of flexibility as well. This is the actual state, we manage automatically 10,000 pipelines, 37,000 policies, so those accesses to certain parts of the secret management solution. 3,000 service accounts, 20,000 standard alerts, 3,300 custom alerts, we’ll check that in a bit. This allows the engineers to do 100 deployments on production per day. There’s a lot of innovation already happening from the engineering perspective that we allowed as well to have it without being directly inside of the loop.

Less Petting, More Breeding (Alerting Use Case)

Let me reiterate again, it’s important for us to have patterns. You cannot automate if you don’t know the pattern of it. How many engineers are here? “I have a different need, I’m different from the other ones”, most likely you are thinking about it. The fact is that from the company perspective, you’re probably not that different from the others. If you think about it, you’re not that different. Your business need is different. The way that you should develop should be most likely similar to one of your brother teams. Why? Your position probably is not going to be that one during the entire lifecycle of your career. You’re going to change teams. Imagine you go into another team, now you need to understand how they build everything from scratch, the commands to build the applications are different, the stack is different, you don’t recognize that. You wouldn’t like that. Be in the other shoes’ positions, and most likely, if you know how to create and build and maintain the application, you just need to focus on the business part and the code part of it.

Let me give you an example of how this is an advantage for the engineers and also for the company. I told you as well that engineers had full control of observability. They were able to go there, tweak things around, create alerts, dashboards. This was, I have all the power, I can do anything. Wrong. The thing is that it created high noise. Why? First, the alerts were all different from one another.

The other thing is that my case is special. My alert is critical, I don’t care about the other ones. My alert is critical. The alert for it is at 3 a.m., I don’t care. Call me in the morning. Everybody was saying that it’s critical, but nobody cared. We have, and we will see, thousands of alerts, and nobody was acting upon them. What did we do? We used that SLO configuration, again, started in the YAML, then turned into a UI. Rhea, again, in the middle here, are doing their job, creating the necessary configurations in the service monitoring, because we have different tools to monitor the services and databases. It’s doing their own thing, creating their own policies, all of that, and alerts.

Then we have our own pigeon that receives the alerts, either from the monitoring tool or the database monitoring, and delivers the message to Slack. In this case, Slack is our communication tool, and we have our own tech support system, 24-7, that’s looking at it and acting upon it. The Slack message was unified. It always starts with the system that is breaking, or, potentially, has an issue. It has a name for the alert that is being raised. Again, there are certain alerts that the teams cannot even disable. Availability is one of them. Your service must have a notion of availability. That’s one example. You can tweak certain things like the tier, and where you are, but you cannot tweak much more than that. The environment where it’s breaking, and also the criticality of it. We also enforce certain rules for it when we have warnings and we have criticals.

The important thing is that it tags the owner team. The team that is responsible for that component is the team that is going to fix it, or potentially investigate and then escalate to other teams if necessary. Why is this important as well? Not only for the resolution of the incident, but also because we take noise very seriously. We don’t want noise. The system should alert when actually there’s something to act upon. One thing is one alert at one particular time.

The other thing is having multiple alerts during a time window. If you have more than X amount of bugs, or X amount of unavailability for a big time, we create what we call a bug ticket. If you have more than X amount of bug tickets, two, you’re a block of innovation. Basically, that’s your error budget. You are not allowed to change it. Only emergency changes, and if the head of engineering, other people actually approve it, but you, as an owner team, you cannot bypass it. This feeds the loop in which the alerts are defined already out of the box. You don’t need to do them. It actually is used to put some rules and governance on top of it. The impact of this is that we had around 300 alerts per day 6 months ago. We were able, in the good days, to reach 30 just by doing this change. With the benefit that the engineers don’t need to know the tool and all of that, and get all of that for free. Not all of the days are 30, but the components still need to understand what they are doing.

What About Datastores?

I’ve talked about stateless things. Services, which usually are deployments in Kubernetes, or alerts that we can generate. What about databases? Like I mentioned to you at the beginning, we manage our own databases. Why? Because data is important. For us, data is critical, and we are the ones that know how to handle it better. Also, the cloud providers don’t provide certain functionalities that we do. At least a year ago, I’m not sure if this has changed, if you want to upgrade, we use Postgres, from one version to another, a major version, you need to add on time. You had to shut down your systems, upgrade it, and bring them up. It would be sometimes hours, if you are talking about 60 terabyte clusters. It would take a long time to do it. We are able to do this in a few seconds of unavailability. We will check this in detail. What are the typical operations that the engineers ask of us, or that we need to do?

New clusters, new databases within those clusters, upgrade horizontally or vertically the cluster, increase the disk size, decrease the disk size as well, do resource tuning, whatever. A lot of actions that they need. For that, we also create a couple of extra tools. We created one called Robocoder, which actually helps in ensuring that the virtual machines and the operating systems are always up to date. It basically builds a new base image, puts the new patches, and then triggers the necessary rollouts. We’ll check that part, and how that happens. If you go into the configuration of the cluster, you can also tweak those things. We have also a change detector that, you want to do this special operation, let me help you on that by triggering that automatically. The engineer doesn’t need to go and run special jobs for that. It’s more of a GitOps approach: you change it, then we apply it. There are certain changes or certain events that need to be handled by the database system itself.

In Postgres, if you have a system that is replicated, there might be some instances that can become a little bit behind of master. That can be because the system is under load, a client is misbehaving and hammering that particular instance. Sometimes it just goes so much behind that it cannot catch up with the main instance. This creates issues because the master instance now cannot free space because it has another one that is lagging behind and it needs to store that data for the other one to catch up.

The consequence of this is that the disk can keep increasing, and then you have a master failure or the cluster failure because you ran out of space. A couple things that we do. We do what we call the re-init operation over the cluster to re-initialize one member to put it exactly as the other ones. Other things that we also look at is that if you are running out of disk space, we automatically increase. We increase for the entire cluster, not just that particular instance. Cloud providers allow you to have automatic disk space increase. If you have a cluster with four machines, they would be probably out of sync on that. It’s something that we don’t want.

Unattended DB, Major Upgrade

How does the actual upgrade happen? The Robocoder auto-applies patching. The initial status, let’s say that we have a primary instance with a bunch of replicas. One replica has one important part, which is the analytical one, which we use for the data warehouse analytics. The first thing that we do is that we upgrade that analytical instance. Let’s say that it was Postgres 14, we upgraded to Postgres 15. For that one, we are able to have downtime if we want. That’s not a big deal. That’s the primary place where we do the experimentation of upgrading it. When I say experimentation, we actually start in dev and then we go to prod. We know that prod sometimes with the data is a little bit different from dev. We start with a less critical one. Then, after the analytical one is upgraded and it’s completely in sync with master, all data replicated, we do the same for the other replica. Then we do again the same for the other replica.

At this point, all of the instances, instead of the primary, are already in the new version of Postgres. Let’s call it 15. The previous one was 14. Primary is on 14, all of the other ones are in 15. Now comes the tricky part. What we do is we do the switchover. The primary instance now is the 15. In effect, the cluster is already upgraded. The clients now see a Postgres instance 15 on top of it. This is the critical moment because rolling back is a little bit more tricky. If there are failures, we tend to roll forward. It also can cause a little bit of unavailability. What I mean by unavailability is that the primary serves write operations, so the clients will need to acquire a new database connection to the new primary instance. This is the downtime. It’s the time that it takes to react, and, “The primary is not this one anymore. It’s the other one. Let me acquire the connection”. Because of it and because we have tight constraints as well in terms of delays, we just reject the HTTP request and then retry again.

Our systems are also built to retry operations if needed, or gracefully fail the client’s request. In your mobile, you can also have, please retry the action. We were unable to do that. We built the operations in order to be resilient on that. This is the most critical part. If all goes well, it goes well, fine. Then the last thing that we need to do is actually upgrade the replica. Then the primary, the entire cluster is upgraded. Again, a new automation that we built on top of this because before this it was done manually, host by host, and it would require a DevOps engineer to be there. Now we can do this just by automation, because every week we do this for the entire infrastructure. Just speaking about numbers, this is on a monthly basis. We do around 3,000 operating system upgrades, 2,000 automatic configuration changes, and 200 version upgrades. We always try to keep as well with the latest version of Postgres to be on top of the edge.

Learnings

Some learnings, and what are we going to do next? Learnings. Have a good quality systems catalog. It doesn’t need to be super fancy, super-rich of data, 3,000 fields that nobody knows what to do with them. Each field should have a purpose, it should be valuable. Try to minimize as well. Try to capture the essence of what you need. As you saw from Tower, I didn’t show you the edit example, but if I was going to create one, I didn’t have a lot of fields to manage. The more you present to the engineers or to the product owners, the more you move them away. Also, every field should have a value in the feedback loop. Think about that part. Here the point is, less is more, because it will as well promote that catalog is up to date. With that data, you can then automate and have more insurance that the automation will work. Not only that, the patterns are essential.

If you have less things to manage, less metadata to manage, less possibilities for you to choose even, let’s call it the stack. If you saw, alert-courier was a Python service. We don’t have a lot of stacks. We have Java, Python, Scala, and ML service. We don’t have a lot of stacks. We are very predictive in how things are built, because we want to enforce those rules to make it super-fast for you to start creating your own service.

At this point, you probably hate me because I’m saying that you cannot have that space. Sponsorship is super important. Our CTO actually pushes to have that standardization. He feels that you can only move faster if you are super quick in having those ideas out and in the field. If you are doing everything by yourself, it’s not going to scale. Allow room for extensibility. That metadata case was one example. Let’s say that the company now decides that it’s important for you to have a new stack. It should be easy for you to integrate that stack. It shouldn’t be difficult. It should be a challenge to evaluate if that stack is important for the entire company. Once it’s ok, it should be easy to incorporate.

Future Actions

Just a tease of the things that we are still thinking about doing and actually acting upon it, so that we can even scale more, because we want to continue to scale. One of our goals is to reach the moon. We want to have fully zero config in any component. We haven’t reached that for all kinds of components. We have that for Java, for Python, for machine learning services. Web application engineers still need to create a lot of things by hand, including nginx configs, and all of that. Super tedious. Super repetitive. We want to eliminate that part as well to make it easy. Source code repositories are still created by hand, so another thing to eliminate. I also told you that we are increasing the disk. This basically just increases the cost of call. We need to reduce the disk, as well. You probably know that cloud providers don’t allow you to have this functionality, so pay more for us because we are happy with it, and you can stick to it.

Another thing that is difficult is that we are allowing very easy onboarding of components. We create all of that space for the engineers almost at day zero, but we still need to clean things by hand. Because it’s easy for engineers to create, they don’t have so much care or so much thought process on, is my service actually needed? Is it not needed? My name is correct? Things like that. It’s easy for them to create a new component and get all of those things. Then it’s up to us to clean those things up.

Last but not least, which is our Achilles issue, is self-servicing. With all of those engineers, you might know how they deal with service desk and the support. We have thousands of people asking for help monthly. It’s one of the busiest channels that we have in the company for requests for support. Sometimes because the tendency is not to look in Confluence for solutions, not to talk with their functional manager, with peers, or even searching in Slack for previous conversations. They are just a little bit lazy on searching for solutions. Sometimes it’s not easy as well to find them. This takes capacity from the team. We are 16, 2 of them are team leads, so already has effort on the leadership part. The other two are what we call on duty, just attending to those requests. The capacity is actually limited because of that. We want to cut that part and give more self-servicing to the engineers.

Questions and Answers

Participant 1: I have two questions. The first is about the databases, because, for me, over 1000 databases is a big number. I assume that in this case, it means that application teams are responsible for their own database completely.

Amorim: We call them product teams because they are indeed product teams. They are responsible for building and operating their own resources. That doesn’t mean that they are able to jump into the machines and do all sorts of operations. We as a platform team, we also need to provide the framework and the capabilities for them to understand how the system is behaving. With that, they can analyze and either solve them by themselves or escalate to other teams like us to help them in solving their issues.

Participant 1: In that case, how are analytics handled for this data?

Amorim: What I’ve presented are the operational databases in which they have a particular instance used by data warehouse to extract information and put it actually in the data warehouse system where they are explored. That is handled by another team which we call the data platform team. Actually, multiple teams into it, that handle that part in another way.

Participant 2: How do you handle the dynamic requirement of the team? For example, if somebody has to create two databases, three EC2s or something, is it by metadata or you have a workflow on the UI in the IDP that a user creates?

Amorim: Tower is a source of truth in the sense that you first need to register that particular database. I only showed you services, but there was a connection to a database on that case over the alert-courier. That database needs to be registered first. Just like a person creates a service, he would need to create a database as well. The database goes into a change management process where the team lead also reviews if that makes sense. Internally, we have what we call the system design review document as well. If you are creating new components or new services as well, they need to be thought about and agreed upon by the entities that make sense. It starts with that.

Participant 3: I have a question about the demo part. You have shown essentially some tool that takes a file in a repo, does some processing, outputs a file in a different repo. Is that something that works just for day zero, or is that something where developers can still on day 2, 100 days after they provision the original app, can make changes, and those are reflected into the file in the other repo? If so, which challenges are you seeing with that?

Amorim: The demo that I showed you is our tool, but it’s actually running all the time. Internally, it listens to changes in the Git repository. What happens is that Tower, it dumps automatically that YAML file that has all of the components, databases, into the Git repository. It creates a commit, and we listen to that commit, and we act upon it to then generate new commits. It’s basically transpiling stage. In terms of a challenge, this is actually a challenge. We have multiple stages of transpilation, and we want to cut that out. We should listen to events created directly by Tower, and consume them, and immediately interact with the systems, probably most likely creating the commit, because we still want to have the change management on it.

Participant 4: With this configuration being very much driven by Git, do you store this configuration with the product teams themselves, or do you have a central place where you store all the core detail for the entire estate? How do you manage deployments and releases in this setup?

Amorim: Tower is becoming the source of truth for configuration. It’s not there yet. Web applications are still managed by certain configs directly done as well in Git. We want to move them to Tower. It’s still not yet the source of truth. Engineers still need to code certain things in YAML, in our own specific language, which is simpler than interacting directly with Terraform, for instance, but it still requires that part.

The engineers are responsible for deploying their own products. We have a process for it as well in terms of release management, but they have their own capabilities. In this case, Rhea is not only generating the pipelines, it also is generating a permission set for the engineers and the teams. It enforces certain rules as well in this case, like I mentioned to you, in TeamCity. This automatically gives permissions to the teams to be able to do certain actions, and one of them is deploy. They don’t need to request, I want to have access to deploy to this A, B, or C. If we go back to Tower, this team, which is actually mine, has the ability to deploy to production this component. It depends a little bit on the synergy of the people into it, but it has that capability. This is also done automatically by the system.

Participant 5: You mentioned about the database upgrades, and I saw that you are keeping replicas running from your primary across different versions. Are you using Postgres logical replication?

Amorim: Yes.

Participant 5: How do you deal with DDLs, schema changes, during the upgrade process?

Amorim: DDL still happens with logical replication, if I’m not mistaken. DevOps is composed of two teams, SRE and DevEx. I’m more in the DevEx, so this is the responsibility of the SRE. I believe logical replication still happens.

Participant 6: You mentioned that applications have got different SLOs, depending on what tier they’re in. Could you tell us some more about that?

Amorim: We have four tiers, 99.99, 99.9, 99.8, and then 99.5. Even that degree of freedom is very limited. If you are building a component, it’s supposed to be up and running. If it’s not, probably it’s not useful. That’s one of the SLOs that the engineers need to have. That’s the one that is mandatory. Let’s say that you also have write operations, that you are not only interacting with others, but you’re also receiving. You also have what we call the timeliness SLO, and this is the engineers that need to opt in or opt out, depending on the case. They can also define, how much time does it take to reply to certain requests. If the average takes more than that, you’ll also have a breach in the SLO. The availability, we use Apdex score, which is the percentage of requests and the ones that were satisfied in time, divided by the total requests.

See more presentations with transcripts