How Amazon Uses Guardrails In Software Development

Carlos Arguelles spoke about Amazon’s inflection points in engineering productivity at QCon San Francisco, where he explained that shift testing left can help catch issues early. He suggested using guardrails such as code reviews and coverage checks. Your repo strategy, monorepo or multirepo, will impact the guardrails that need to be in place.

When a company is new, it has to move fast, so there are very few guardrails in place. It’s also generally small enough where you know all the people touching a codebase, Arguelles said. As your customer base grows and more and more people depend on your product, it becomes increasingly important to adopt best practices and ensure everybody adheres to them, he added. Those guardrails don’t come for free: they add friction and reduce your ability to move fast in places, so there’s often a tradeoff here.

As a company grows, investing in custom developer tools may become necessary, as Arguelles explained in Inflection Points in Engineering Productivity:

Initially, standard tools suffice, but as the company scales in engineers, maturity, and complexity, industry tools may no longer meet needs. Inflection points, such as a crisis, hyper-growth, or reaching a new market, often trigger these investments, providing opportunities for improving productivity and operational excellence.

The software development life cycle at Amazon starts with an inner developer loop, where an engineer iterates on a piece of code in their own workspace. By default, engineers run unit tests, gather and gate on code coverage and execute various linters every time they create a build, Arguelles said. When they submit a code review, the code review tool runs a number of additional tests on that code.

When those validations pass and the engineer has thumbs up from their reviewer, the code gets pushed to a repo and CI/CD shepherds that code change through various testing stages and increasingly exhaustive tests, including load and performance tests, before finally pushing to production, Arguelles explained:

We expect most of our code changes to reach production within hours of checkin, as all the guardrails should be automated. Once they reach production, canaries further validate the changes.

Finding issues earlier is always better, Arguelles mentioned. And shifting testing left so that you can catch issues in the inner developer loop (pre-submit), and not in the outer developer loop (post-submit) becomes more and more important as your codebase grows:

This is because if a bad piece of code is submitted and merged, it now blocks N developers around you until it’s rolled back or fixed, and as your system grows, so does N.

Shifting testing left is non-trivial because in order for you to effectively and reliably run end-to-end integration tests against unreleased code you need to invest in ephemeral, hermetic test environments, Arguelles said.

One critical decision that Amazon made is that they operate in a multi-repo world, meaning every team has their own micro-repo. This is in contrast with other companies that have gone the route of a monorepo for the entire company (like Google).

Google uses a monorepo where ~100k engineers use a single repo with no branches. Everybody has to be committed to the health of the repo, because the blast radius of a bad code checkin is enormous; you could literally block thousands of people, Arguelles said. Pre-submit end-to-end testing is not just “nice-to-have,” but business-critical.

Amazon, on the other hand, chose to go the route of multirepos, with every team having essentially their own repo, Arguelles mentioned. This acts as a natural blast-radius reduction mechanism: a bad checkin can break an individual microrepo, but there are more guardrails in place to prevent it from cascading into other team’s microrepos.

Arguelles mentioned that there’s unavoidable complexity in large systems. How you choose to do your development determines where in the software development lifecycle you need to deal with that complexity. Neither company was able to avoid it, but they’re tackling it either pre- or post- submit, he concluded.

InfoQ interviewed Carlos Arguelles about guardrails and micro vs mono repo.

InfoQ: How do guardrails at Amazon look?

Carlos Arguelles: For example, when Amazon was a much smaller company in 2009, you could quickly ssh to a production host to investigate an issue, tail logs, etc. Any manual interaction is inherently a dangerous thing to do – imagine you type the wrong command or accidentally press an extra “0” and bring down that host or cause data loss. So we put guardrails such that if you want to run a command on a production host, it needs to be code reviewed and there’s an auditable record of your action.

Another example is you cannot submit code directly to a repo without going through code review and having a thumbs up from another person. Or we now gate on code coverage drops to ensure all code going to production meets a minimum test coverage bar. Tooling like this provides scalable ways to encode and enforce best practices.

InfoQ: What are the implications of working with micro repos compared to monorepo?

Arguelles: This is a place where the two places where I worked, Amazon and Google, differed significantly because of a foundational decision made by both companies decades ago: Google using a monorepo and Amazon micro repos.

This decision impacts guardrails that need to be in place. For Amazon, end-to-end pre-submit testing was “nice-to-have,” but it was not business-critical.

The decision to have micro repos at Amazon did not come for free, though. In a monorepo, every service is naturally integrated, whereas in a multirepo world, you need to coordinate how changes safely cascade from repo to repo.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply