Architecture In The Lead: Scaling Today, Shaping Tomorrow

Transcript

Ian Arundale: In this talk we’re going to take you through some of the secrets of how we deliver some of the UK’s biggest digital moments. The good, the bad, the ugly, and the outright terrifying. Before we get into that, since this is a presentation from the BBC, we thought that we might start with some time travel.

Matthew Clark: Time travel?

Ian Arundale: Let’s jump in the TARDIS. Let’s get in there.

Matthew Clark: Ian wants to take us on a journey back in time. Where should we go? I know, let’s go back to the first ever QCon conference of 1892. I was there, anyone else there?

Ian Arundale: Being part of QCon, that’s easy.

Matthew Clark: I think we had a bit of a diversity problem then as well.

Ian Arundale: I wasn’t thinking we’d go that far back actually. I’ve got a particular date in mind actually. How about the 12th of December 2019?

Matthew Clark: Quite specific, 12th of December 2019.

Ian Arundale: Ring any bells?

Matthew Clark: Anyone remember that one? It was, if you’re geeks like us, the date of the UK general election. This was the moment when Boris Johnson won with a handsome majority, promising to get Brexit done.

Ian Arundale: The interesting thing about that election though is it was only called six weeks before it actually happened. So late in fact that we’d already booked to have our Christmas party that very same night.

Matthew Clark: Was this really the image of the Christmas party?

Ian Arundale: No, not quite, this is actually what ChatGPT thinks a BBC party looks like.

Matthew Clark: Yes, it’s probably quite accurate.

Ian Arundale: I think it’s fair to say we’ve all been to The Broken Merge at some part of our careers.

Matthew Clark: Yes. It was all at last minute and we were rushing through it, but it was very important. Here’s the BBC newsroom. If you don’t know how UK elections work, polls close at 10 p.m., and at that point the broadcasters like the BBC are allowed to show the exit poll, which is very important because it’s very accurate. It’s basically the moment when we find out who’s won. It’s a very important moment, millions are watching TV or online, we have to get this moment right.

Ian Arundale: Let’s set the scene. We’re all in the newsroom. We’re all responsible for the BBC website and apps. We can see from our analytics there’s more than a million people online right now waiting. This is what our page looks like. It looks a bit empty right now, but don’t worry, as soon as the results come in, this page will automatically update.

Matthew Clark: We have 5 seconds to 10:00. Here we go, polls are closed now. Brilliant. Here come the results. There’s no result.

Ian Arundale: It’s cached. Don’t worry about it. Just click refresh. It’ll come through.

Matthew Clark: Ok, refresh, refresh, refresh.

Ian Arundale: Again.

Matthew Clark: We did this for several minutes, got RSI in the process. There’s still no results. This is not looking good.

Ian Arundale: What’s going on here? Let’s have a look at our architecture from back then. This is roughly how it looked like.

Matthew Clark: Let’s see what’s going on. Has the data supplier sent us those exit poll results? Yes, they have. Have we ingested it into our database? Absolutely. Can we call the API and see those results? Yes, you can. Our website and our mobile apps are not showing the results. How is that possible? These are stateless servers that are just taking the API and spitting out HTML. How can they possibly not show the results?

Ian Arundale: There’s only one thing for it. Let’s dive into the cloud console and take a look at what’s going on.

Matthew Clark: Let’s look at what’s going on, except this is 2019, we haven’t finished our migration to the cloud yet. This is running on some remote servers in some data centers we don’t own. People who look after that aren’t working, it’s 10:00. They’re probably at the Christmas party. What do we do? We’ve effectively built something that is unmaintainable. What did we do? For that, you’re going to have to wait for the new series coming to BBC iPlayer next month.

Ian Arundale: No more promotions and no more ChatGPT images. I bet you’re all wondering how many GenAI images you’re going to see in this presentation.

Background

My name is Ian. I am Lead Architect at the BBC. I’ve been here for about 14 years. During my time here, I’ve been able to work on some incredible projects, including working with the team that launched iPlayer on TV. A little fun fact, that app also had to work across Internet Explorer 6.

Matthew Clark: What was this? This is a TV app from, what year was this?

Ian Arundale: This is back in 2011.

Matthew Clark: 2011. We still do, write TV apps and HTML apps, and it had to work on IE6 as well.

Ian Arundale: 100%.

Matthew Clark: Brings back bad memories of the old days.

My name is Matthew. I’m Head of Architecture for all of our websites, mobile apps, and indeed, the TV applications that Ian just mentioned. One of my coolest moments was when we got a phone call from NASA back in 2016, 2017, asking if there’s any way we can get iPlayer working on the International Space Station. Turns out, we had a British astronaut up there, Tim Peake, big rugby fan, wanted to watch the Six Nations. He very kindly sent us this. The back story is that, like a lot of broadcasters, we have rights issues, we have to lock things down. We set things up to only be accessed by certain ISO country codes. There isn’t an ISO country code for the International Space Station, so we had to make one up. There we go. We are very proud of what we’ve done.

Ian Arundale: I’m not sure which one’s harder, getting iPlayer onto IE6 or into space.

Matthew Clark: IE6, definitely.

Ian Arundale: Yes, I think so too. One thing we are very proud of is what our architecture has allowed us to deliver. Did you know that the BBC News website is the biggest news website in the world?

Matthew Clark: By a country mile. Look at it.

Ian Arundale: Yes, not even close.

Matthew Clark: Second is the New York Times. Distant second. We’re very proud of that one. Since we’re showing off, one more stat for you. If you’re not from the UK, you may not have heard of iPlayer. It’s our video on-demand, VOD platform. I think Netflix, Amazon Prime, Disney, and we actually are doing better. We’re growing better than all of those are. Partly because we do a real breadth of stuff. We have our programs, but we have things like the general election that we do live streaming on as well. We have a nice breadth of stuff. Still, those are the big tech giants. They are massive. Bet they don’t have Christmas parties at that pub. They have a lot of money, huge engineering teams. We genuinely believe one of the reasons why we’re punching above our weight is because of our architecture unlocking the way, there’s a chance to do things without that massive investment that some other tech giants may have. Now, of course, we’re talking about architecture.

This is a well-known fact, that every architecture is different. This quote from, I think it’s the very first chapter, actually, probably written by Neal Ford, or one of the authors of this one, “For architects, every problem is a snowflake”. What they mean by that is every architecture problem is different. The context is different: different organization, different situation, different existing things. You can’t just look on Stack Overflow. There is no one architecture solution for your thing. Why are we here, if our solution is different to what they need?

Non-Functional Requirements (NFRs)

Ian Arundale: There are some non-functionals that apply across domains, though. Why don’t we explore those?

Matthew Clark: Let’s look at the 142 non-functional requirements that you can see on arc42.org. Is it 151? I can’t remember. It’s a lot of them. To be fair to this site, it does group them. This is trying to bamboozle us, this slide. These are all really good. Let’s call some of them out. Fault isolation, intrusion, transparency, speed to market. These are all wonderful things. Now, our architecture, your architecture, will, of course, be based on the functional requirements. Hopefully, your organization knows what it needs. There is also the non-functional requirements. We do believe it’s worth the exercise of sitting down with a list like this and considering. You probably want them all, or most of them. If you had to pick a few to really bake in your architecture, which ones would you do? Let’s give that a go for our example. Big events like general elections at the BBC, what three would we pick? We’ve given that a go.

First of all, we’ve gone quite specific, because why not? Elasticity. This is scaling. Scalability, being able to handle growth. We’ve gone with this particular word because, as we’ll show in a bit, we find the BBC is particularly spiky. Massive audiences can turn up or big amounts of data to handle for short periods of time. Being able to scale very quickly, be elastic in our nature, is really powerful. Second one, resilience. We saw that with the example at the intro. If you mess up at 10:00 on election night, you’re in trouble. You can’t run the election again. You’ve got to get it right. You’ve got to be resilient. You’ve got to have built maintainable systems. Thirdly, we’re going to talk about security.

Ian Arundale: Security is another one. Really important to the BBC. Our audience reputation is the only thing that we’ve really got. Security is an incredibly important non-functional for us.

Matthew Clark: We’ll go through these three, because your list will be different, because your situation will probably be different anyway. That’s what we recommend as an exercise. Consider that list, which would be your top three?

Ian Arundale: Let’s talk about some of our technology choices. This is essentially our default stack at the BBC. We are serverless first, which brings us a bunch of NFRs out of the box. You think about speed of delivery, security, cost. A lot of those non-functionals that Matt just showed us, we get straight out of the box. This is about standing on the shoulders of giants and leveraging that infrastructure. That opens up our teams to focus on different business problems. If you were to look into a random team in the BBC and look at the tools in their toolbox, this is typically what you will find.

Matthew Clark: These are from Amazon, if you don’t know your different cloud provider serverless technologies. Of course, being the BBC, we have to say other cloud providers are available.

1. Elasticity

Those do provide all three of these, as well as many other NFRs, or help with a bit. Let’s delve in a little bit to elasticity.

Ian Arundale: Let’s talk about scale. Let’s get some numbers out. Our peak request per second during the general election last year was about 38,000 requests per second. That’s about 10 times our usual traffic levels, which hover around the 4,000 mark. What we do see is during denial-of-service attacks, that can increase to 100,000, 150,000, 200,000 requests per second. This is the traffic profile from the night. You can see it starts quite steady, that’s the 4K mark, up until about 10:00, where you see that huge spike where the exit poll is revealed. That’s not the end of the story, because you can see through the rest of the evening and into the following day, there’s still spikes that are coming up, which are driven typically from things like breaking news alerts, which we can’t predict for. When you can’t predict for the traffic, that raises the question, how can we scale for events like this?

Matthew Clark: It’s interesting. That first one at 10 p.m., we did know it was going to happen, but we didn’t know how big it was going to be. Then the other ones we didn’t know were going to happen. On a normal day, you might not know what’s going to happen.

Ian Arundale: What’s our approach?

Matthew Clark: There’s a few different ways we do this. Not rocket science, but good standard stuff. First of all, we use a CDN, content distribution network. Basically, it’s borrowing other people’s servers out on the internet that can handle large amounts of connections from large amounts of users. They often have a cache as well, so if lots of people are requesting the same thing, the cache can respond with it, so it doesn’t then hit your system. What’s the best architecture design? It’s the one you don’t have to worry about. Outsourcing this problem to the CDNs. For our live video online, this is pretty much how it works. At a very high level, we have our live video stream. We encode it. We package it. Basically, it means putting it into chunks of a few seconds, and you give all of those chunks to the CDN to put in its cache.

Then, when a few million people turn up, no worries, they all get the same chunk. They’re all watching the same thing. We don’t know, in theory, anyway, that there are millions of people watching it then. That’s all quite nice, actually, for video streaming. For the more dynamic stuff, it gets a bit harder. If you’re building a website or a mobile app with lots of interaction, lots of different content, lots of change, and lots of personalization, considering the user, are they signed in, where they are in the world, it gets a bit harder. Because you can’t just cache everything and give everyone the same thing. Let’s delve a little bit into that. There are a few patterns you can do for that.

First thing we do is we run a lot of our websites using Amazon Lambda, serverless. Tens of thousands of requests a second. Tens of thousands of Lambda invocations a second. Not a problem. Scales up very nicely. Set your limits right. Happy days. Off we go. Same with the APIs powering our apps. Scale that up. Lots of requests. All personalized. Not a problem at all. Works out, actually, not too expensive, either. There’s also caching involved. Let’s not do anything multiple times if we don’t have to. Behind that, we have APIs, of course. We call it a business layer, where we do our logic of understanding our content, and the user, and what to recommend them, and all these bits there. Again, we’ve put it behind Lambda. We’ve given it a big cache. Redis, or Valkey, as Amazon is now increasingly calling it, that’s not serverless, but they’re amazing. Those boxes, they can handle phenomenal levels of requests. The blue boxes, the actual code, is all running on serverless. It scales very happily.

Then, once the peak’s over, it doesn’t cost you any more. There’s no downsizing. You’re just paying for what you need. When you get further upstream to the actual datastores, this is when it gets harder, because databases aren’t quite as elastic enough. They don’t scale quite as quickly as compute does. Things like Dynamo for Amazon, if you throw 10 times the number of requests per second as you’ve scaled it to, it’s not going to be happy at all. What we do is we put a service gateway in the middle.

The job of this is to, again, cache like crazy. Because a lot of the time, you can rely on the same content to different people. Recommendations engines and the user activity for things like continue watching and other examples, you may not be able to scale them. You can probably cache a reasonable amount. If necessary, do graceful degradation. One final thing to call in is pre-compute. For our recommendations, we cannot run those ML models quick enough for when a user visits the page, so we pre-compute them. iPlayer pre-computes 30 million people’s recommendations several times a day. They’re probably not all going to turn up in one day, but they’re all there sitting ready in a database. If they do, they’re ready to go.

Ian Arundale: You mentioned Lambda there. We use Lambda at huge scale. In fact, just in one AWS account that serves the traffic for news and sport, we have seen more than 27 billion Lambda invocations in just a single month. We’re seeing this number increase on a month-on-month. We are huge advocates of serverless compute.

Matthew Clark: It’s not just the scaling that’s wonderful. The speed is wonderful as well. This is a good example. We are the second fastest news website in the world. Anyone from NDR here, the German news website? Your site is fast, but we’re coming for you. Here’s another great example of how we scale. Different kind of scaling this one. On the election night, not the 2019 one that we started the talk with, but last year’s general election in the UK, last July, we sent a camera to all 373 places where the counts happen. You may know that all kinds of counts happen.

Every area has one in sports center halls or town halls, that kind of thing. We sent a camera to each one. How did we do it? What we did was we found 373 volunteers like this woman here. We sent them a package, that black bit there. Inside there was a tripod and an iPhone with a 5G SIM card in it, all preloaded with an app ready to go that can do live streaming. All being well, if there was enough signal, you get to do a live stream. We got them for all but 4, 369 worked, 4 didn’t. That was pretty good, given that no broadcaster has ever gone that far before with broadcasting the live count.

How does that scale? Pretty straightforward, again, thanks to the cloud. We have our phones acting as cameras, they over 5G or sometimes Wi-Fi, if you got lucky at the venue, sends it across to a receiver in the cloud. We have our own routing solution for handling this. If you’re into video technology, we used a GStreamer library, which is a wonderfully flexible live video library from which we can then add graphics, do monitoring and distribute it elsewhere. We can send it to our transcoding and packaging solution, which you might recall from the previous slide, then goes to a CDN and finds users. That stream can then be watched by millions of people, should they want to. I’m not sure millions would want to watch one of those, but it’s there if you did. If this isn’t a talk on video streaming, but if you’re interested, we did manage to get 1080p, 10 megabit a second working over 5G.

A couple of proprietary formats, but RTMP still very much in the mix, if you know your streaming protocols. All the rest of it was running on the cloud. The wonderful thing about that is you can scale up to 373 of them, no problem at all. Might cost you a little bit, but 24 hours later, scale it down, gone away. Another scaling of the broadcast variety. That, I think, sums up where we got to with elasticity in. We talked about caching, serverless, pre-computing those recommendations, and really making use of that elastic cloud for generating content.

2. Resilience

Ian Arundale: Let’s start to talk a little bit about resilience, because when people come to the BBC, they expect us to be there. No loading spinners, no 500 errors. Who remembers this? I imagine quite a few of us. This was the PM’s first address on Coronavirus 19, and we streamed it live on iPlayer. Unsurprisingly, the whole UK decided to turn up, and presumably to check if they were still allowed to go outside and buy toilet roll. That aside, some of those users actually saw this, which is a big problem. For the BBC, when we get it wrong, people notice. We end up in the news, and it’s really bad for our reputation. Matt’s talked a bit about elasticity, but what do you do when you can’t scale quickly enough? When you end up in a situation like this, where there’s a system that essentially is broken or struggling to keep up with demand, that’s exactly the situation we were in.

Matthew Clark: Here is a very high-level architecture of our site. The next one, speed is going to give us a problem.

Ian Arundale: This is the situation we were in, absolutely. The root cause was quite deep within our technical ecosystem. It was account services and personalization services that experienced essentially slowdown. As a result of that, we saw cascading failures throughout the stack, which eventually made its way to the user. When you end up in a situation like this, there’s three key takeaways.

The first one is that in distributed systems, the concept of backpressure and time budgets is incredibly important. When a system is struggling, it needs to have the ability to put its hand up and say, to the other systems that are consumers, you need to back off. Or those consuming systems need to be able to use things like timeouts to detect when a system is slowing down and then take that system off the critical request path.

The second thing that we learned was feature toggles can be incredibly useful during operational events. Having the ability to dynamically change the product behavior on the fly is incredibly useful. I mentioned account services. We were able to go into some of our audience platforms and turn off mandatory sign-in, which was great. The problem that we had is that we didn’t have consistent feature toggles across web, apps, and TV. If you’re building for multiple devices, that’s something to keep in mind.

Matthew Clark: It’s interesting. The first one’s about automatic detection of unhealthy systems and automatically doing something about it. The second ones are more manual, having switches.

Ian Arundale: It’s always good to have a manual lever when needed, 100%. The third thing that we learned was about end-to-end observability. I’m sure a lot of you will be feeling the pain of this. When you’ve got a bunch of distributed systems, it’s really hard to keep track of all the moving parts. In this example, we had to trace from client telemetry through more than 100 microservices to try and figure out and pinpoint where the problem was. Part of the problem we had is that our teams were using different dashboards and different tools, which just added another layer of complexity. What did we do about this? We decided to take a platform approach.

All of these learnings, we wanted to bake them directly into the platform. We started with the service gateway, which is a component that Matt mentioned before. It includes a public cache and a private cache, but the third part of the equation is circuit breakers. If you build on the BBC’s platform, the service gateway is the standard way of fetching content from external systems. Which means that if an upstream system breaks, the service gateway will be able to kick in and mask those errors.

The second thing that we did is we started to build feature toggles directly into the platform. We’ve still got feature toggles at TV, web, and app. We’ve still got those individual toggles in those spaces, but now we’ve got centralized toggles within the platform itself, which can essentially apply to the whole audience experience if needed. The third thing that we did is we moved to a centralized monitoring and alerting platform, which we can actually see on here. This is the service gateway actually in action. You can see on the left-hand side, we’ve got a bunch of errors from upstream systems shooting up. It’s actually timeouts, not errors. The system that we’re calling is just not responding. You can see on the right-hand side, the platform is still happily serving HTTP 200 to clients. That’s because the service gateway is kicking in with cached responses.

Matthew Clark: Timeouts are the worst. If you get a 500 response, it’s quite easy to handle that, but exhausting the number of connections you have with never responding connections, that’s awkward. This is protecting all the upstream systems.

Ian Arundale: This accounts for all of those use cases too.

Matthew Clark: These tricks, circuit breakers, feature toggles, and common monitoring really become a lot easier if you’re sharing things, rather than having to implement them multiple times. Let me look at an architecture principle we try and follow to make this a bit easier. The BBC has lots of inputs. For us, it’s all our content types. We have TV programs, podcasts, news articles, big long list of content. Nice variety of stuff there. On the right, we have a set of outputs, reaching all our different users. We have websites, we have apps, and specialist things such as for international audiences and younger audiences, that kind of thing. Your architecture, perhaps not the same boxes, but the same principle, because you’ve got a nice breadth of input.

The more input, the more cool stuff you can do. You probably have a nice breadth of output to try and reach more users, more customers. The tricky bit is how you join them together. Because it’s so easy to go, we need one of these things over here to appear in one of those things over there. Same with another one of these things to go over there, and another one to go there. Of course, before you know it, you’re into spaghetti architecture territory. Lots of integrations, expensive to change things. The nicer model, of course, is the hourglass design. Do you see the hourglass? There’s an hourglass there, or cocktail glass, if you prefer, where you have consistency in the middle. You’re decoupling all those inputs from the output. That brings a bunch of benefits. You’ve reduced the number of integrations, all those red lines we just saw. You’re ensuring that all of your lovely inputs on the left can be used by all of your lovely outputs on the right.

Of course, you provide that consistency on which to do some of those things Ian was talking about to help your resilience. For example, where does the circuit breaker go in this diagram? Very simple. In the middle, protecting all those outputs on the right from any one failure from something on the left. Precisely how you implement this will depend. You obviously got to make sure you don’t make it a bottleneck or a single point of failure. Even if it’s just standardization, that shaped architecture is what we aspire for.

We have hundreds of systems at the BBC. Keeping track of them can be hard when you’re in a crisis moment, when stuff isn’t working. Understanding the bits and their relationships and which bits are not healthy is really key. We’ve got lots of different systems, tried lots of different ways of doing that. One way we found just to keep track of things that’s really simple is to just have a YAML file for every system you’ve got, describing some basic things about it. Here’s a simple example in which you can see it’s got an ID and a name. It’s got an owning team. It’s actually snipped off. There’s actually a lot more we put on here, such as contact details, API information, documentation, all those bits. You can put whatever you like in here. Perhaps the most interesting bit is our bottom one, dependsOn. That’s the relationship between the different systems. Once you have, and we have one GitHub repo with several hundred of these YAML files matching to our different systems, you can then write some scripts to do something with that.

For example, you can take a graphing library. We’ve used maxGraph, which is a nice TypeScript one. There are others, of course. You can make automatic diagrams of the relationships between things. Those of us who’ve done architecture for years know that all the automated architecture diagrams out there are rubbish. It is a real skill to make good architecture, that is still very much by hand. I do think there’s a place sometimes for automated diagrams, nothing else because they’re interactive. You can click on these and zoom in if you like and see the components within them, but also information about them, their health state, their owners, their documentation.

Once you’ve got all this, you can begin to do things like we’ve made a web interface, effectively a directory of all the systems. If you’re eagle-eyed, you can see some colors in there. That’s the health of those systems. You can build this as a destination to understand your estate. Brilliant in general, just for organizing yourselves and making sure you’re maximizing reuse, but particularly good for resilience, for keeping track of things during key moments. My favorite bit of this tool is when you ask it to go, since you know of all our systems and all our dependencies, can you draw it all out? What do you think of the BBC’s architecture?

Ian Arundale: It’s beautiful. The one thing I will say is it doesn’t look like an hourglass.

Matthew Clark: The hourglass slide’s looking a bit daft now. Reality is never the same, as is the theory. To be fair, you can abstract it away and it does look a bit better.

We’ve done resilience. Ian’s talked about circuit breakers and feature toggles and how shared platforms make it easier to implement those things. We looked at the hourglass architecture as a way of making that more of a reality and briefly looked at it. You can be quite simple if you want to do architecture as code, as a way of modeling your estate.

3. Security

Ian Arundale: Let’s move on to security. Security is our number one priority at the BBC. The reality is, we are facing threats from multiple angles, whether that’s nation states and governments who want to discredit us for our news output, hacker organizations who want to interfere with public service, or just script kiddies who are basically bored at the weekend and have got nothing better to do. We’re a target by all of these people.

Once again, similar to resilience, when we get it wrong, we end up in the news. The stakes are quite high for us. Reflecting back to 2019, I think it’s fair to say we had a disconnect between our values and our behaviors. We were saying that security was important, but, in reality, it was being left until the very end of the software development lifecycle. This is just as much a cultural issue as it is a technical issue. We think that the role of architecture has a huge part to play in bridging that gap, and ensuring that security happens earlier in the software development process. That’s sometimes referred to as shifting left. Just like security can’t just be InfoSec’s job, it can’t just be the architect’s job either. We need to move responsibility where it belongs, which should be with the teams themselves. We achieve this through threat modelling, which is essentially a way of bringing everybody together, technical and non-technical, and asking three simple questions. What are you building? What can go wrong? What are you going to do about it?

If we take those in order, what you’ll notice is each of these questions have an activity that goes with it, and a specific outcome that we want to achieve. Asking the question, what are you building, is really about getting everybody together, running a workshop, and trying to gain that shared understanding and shared context. What is your system? Do you understand your dependencies? What are the data flows? Personally, I can say I’ve learned loads from just basically getting a team into a room and going, what are we building? Because it brings out all of the assumptions and all of the misunderstandings about what we think we’ve built compared to what’s actually in production. The output should be a technical design. You want a diagram that outlines the whole context, and, importantly, that needs to be owned by the team.

The second question, what can go wrong? You’ve got the context. This is the fun bit. You get to act like an attacker now. You can think about, what in my system do I need to protect? If I wanted to attack it, how would I go about doing it? Even if you’re not a security expert or know how to attack systems, this is the perfect time to learn. You’ve got everybody in the room. You’ve got InfoSec in there with you. Great chance to learn more about that. The way that we support this is we use a structured framework called STRIDE, which essentially takes the whole world of security and groups it into six key buckets.

Then we work together across the whole technical design looking for these six specific areas. We capture them on top of the diagram that we already produced from question one. Which then brings us on to the third question, what are you going to do about it? I can’t emphasize enough how much this needs to be a multidisciplinary approach. You need to get everybody in the room, product, engineering, and InfoSec. It can’t just be InfoSec in the corner just saying no. It needs to be a conversation about risk tolerance, a conversation about likelihood, and a conversation about business impact.

The result of this should be changes to your design. Right at this stage, this is the cheapest part to make changes. It should be, if you’ve got a system that’s already built, you want to get those issues into the backlog so that the teams are working on them, or the business accepts the risks. You can see here each of these activities and outputs, hopefully you’ll have clocked a theme here, that we’re using the design process through the activities to drive collaboration. Through the outputs, we’re shifting that responsibility of security back onto the teams, but doing it in a structured and supported way. What does this achieve? What we’ve done is we’ve shifted security left. We’ve brought it earlier into the development lifecycle, into the design stage. This isn’t the only area that we do it. We also do it throughout the other stages, too.

We apply static analysis tools and dependency checks into our development processes, and we plug them into our platforms and shared CD and CI pipelines. We also do dynamic testing in live as well. We’ll do penetration testing. We’ve got active scanning tools that are basically checking to see that our outer shell, our public-facing APIs, are absolutely rock solid. There has actually been times where we’ve brought down the BBC website as a result of running these tools. Our perspective is it’s better that we find those vulnerabilities and get to fix them, even if it occurs through an outage, than the bad guys finding it first. Security is a process. It’s not a destination. It never stops. It can be quite daunting as well. If you’re not sure where to get started, these are two incredible resources, OWASP and the National Cyber Security Center. If you want to learn more, these are the places to go and look.

Matthew Clark: Including the STRIDE frameworks on them.

Ian Arundale: If you want to learn about STRIDE, shifting left, threat modeling, you’ll find it all in there.

Just to wrap on security, we’ve talked about using threat models to shift security left and influence your architecture at the design stage. We’ve talked about embedding tools into platforms. If you’re not sure where to start, take a look at those resources.

Do Architects Have NFRs Too?

Matthew Clark: Three NFRs done. I’ve just had an idea. We’ve looked at NFRs for our systems, but do we have NFRs as people, as technology leaders, as architects? For example, how’s your elasticity?

Ian Arundale: It’s not as good as that, I can tell you that.

Matthew Clark: Not as good as it is. Come on. Wrong word, maybe. Maybe adaptability, another word.

Ian Arundale: I can see where you’re going with that.

Matthew Clark: Very useful.

Ian Arundale: If only we had a list of such things.

Matthew Clark: If only we had a list of such things? I’m not sure the list of NFRs that we had before quite apply to people. As architects, we may use these NFRs and come up with the best architecture, but it’s all for nothing if we can’t persuade others to make it happen. These non-functionals, these softer skills, are absolutely critical to our role. Take, for example, I do a fair amount of architecture recruitment. I get emails through from recruiters. Here’s an example, genuine one I’ve got.

Dear Matthew, here’s some stuff. Look at the technical skills, Agile, Java, Linux, Oracle. Wonderful. You could call them the functional requirements. I might need somebody who knows Oracle. Great. Really, beyond knowing they’re a great technologist, which is very important, I probably know they’re going to pick one of these up. They want to be the best expert in the world, but as a great technologist you’re probably able to pick up these skills over time. What’s more interesting to me, can you persuade people? Can you explain complicated concepts to stakeholders? Can you inspire teams to make things happen? They’re the skills that I value the most in architects, because they’re the ones that really make a difference.

Let’s give this a go. Here’s a list of 10 non-functional requirements for a leader. Leader, architect, are we leaders as architects? Absolutely we are. We may not have that in our job title, but if we are not able to lead, we are not able to take people on the journey with us and make a real difference with our architecture. Here’s 10 non-functional requirements. I’ve taken them by taking a list from Harvard Business Review of what they believe to be the 10 most important leadership skills for the future.

I’ve rephrased them a little bit to make them sound more non-functional requirement-y, but you get the idea. Adaptability. Can you adapt to the changing world, changing technology? Influencing. Can you influence others, be it stakeholders, to take a different route? Inspiring. Can you inspire teams to go try a different technology, take a different route, make a different thing? Inclusivity. Can you knock down that ivory tower that architecture can sometimes be, to make sure that you’re getting everyone involved in the architecture, making everyone feel they own it. You can probably think of some others as well. Just like with our systems and our NFRs, we want them all, but some might be more valuable to your situation than others. I think it’s the same here, too.

Ian Arundale: It’s quite an interesting observation, actually, because just like everyone’s architecture is different, as an architect, we’re all different as well.

Matthew Clark: We’re all different as well. Our roles will be different. We work for different organizations. We have different positions in them. Maybe different cultures within them. Which ones are most important to your role will depend on that. I think it’s worth the exercise. Which ones are most important to your role? Then, what might you do? What tools and techniques do you have to make that a reality? To give an example of a tool and technique, an architecture canvas. We do this for some projects. It’s a common thing to do. Get a big paper and collectively as a team, work out things like, what is the business value we’re really doing? What is the context of why we’re here? As Simon Brown has said in his books, it’s the context behind decisions that are more important than the technology decisions themselves.

Doing this exercise, it takes off several of those NFRs. It’s influencing by helping people see the big picture and therefore more likely to adopt where you want to take them. It’s inclusive because you can work together on completing a canvas like this. Of course, it’s great for communication as you can show this to stakeholders and new starters. Techniques such as this one, I think, means that you don’t have to be the world’s most experienced leader to have the leadership skills to make your architecture a reality.

What Did Happen at the 2019 UK Election?

Do you remember the story from the start? 2019, UK general election, 10 p.m., and the results have not appeared. Very embarrassing. Can’t repeat the day. What do we do? Of course, a lot of the engineers start opening up the code, trying to work out if there’s some if statement that’s doing the wrong thing. As an architect, I was able to step back, look at the capabilities. I remembered we had another project where I happened to have built a tool that let you access the logs. I realized it could be used in this case as well. Voila, we have some logs. A bit of analysis on that. Cannot connect with pre-production environment. Our production system was trying to call the test database, which had nothing in it. I think we’re lucky it had nothing in it. If it had a different result, we might have even more trouble.

Ian Arundale: It could have been worse.

Matthew Clark: We have very much learned that lesson. In that case, we redeployed, and happy days. Results pop in. They were saved, just about. We learned a lot of lessons that night. 2024, the next general election, of course, last year, things were a lot smoother.

Ian Arundale: It’s fair to say that 2019 certainly kept us on our toes. By ’24, we were in a very different place. By baking in elasticity, security, and resilience into our platform, we’d built an architecture that we could rely on, ready for a general election or whatever major event comes next.

Recap

Matthew Clark: Every architecture is different. Ours will be different to yours. That’s one of the wonderful things. Everything’s a snowflake. Great chance to go and explore and create new architectures. One way to help doing that is to consider, which are your most important NFRs. As Ian’s just said, we had three that we called to today: elasticity, resilience, and security. We quickly delved into a few bits from there. We looked at, for example, serverless. We looked at circuit breakers. We looked at shifting left, and getting those threat models in to influence your architecture. All that is for nothing if you cannot then influence those around you to make it a reality. Finally, we posed that question. Do we have NFRs? What is your role with NFRs? Which ones are the most important? That perhaps is a question we should think about, because it’s those skills of making things happen that are the most important ones we find, to be an awesome architect.

Questions and Answers

Participant 1: Just perhaps trivializing a little bit from the language that you used, on elasticity, you mentioned this isn’t rocket science. Is it a case that these days, elasticity, given the kind of cloud architectures we have, is a relatively easy challenge, whereas resilience and security are growing challenges? Do you think that’s a fair observation, one easy, two hard?

Matthew Clark: Not quite. I think you’re right. If you do it right with the cloud’s help, elasticity can be great, but there are always limits. As you get further into the stack, there are points where you just can’t scale quick enough. Depending on what it is, if you can cache things a lot, if you can pre-compute a lot of things, I think that’s true. If you want to do anything a bit more interesting, you’re always going to hit some scaling problem at some place. You might have just moved it further into your architecture.

Ian Arundale: It’s probably that classic thing of probably the 80% is easy, and it’s that 20% that makes it difficult for elasticity. I completely take your point. Resilience and security, it feels like they are becoming increasingly important across the board.

Participant 2: You mentioned earlier that you implemented systems that had backoffs and also ways to remove a slow service from the critical path. I was wondering if you could explain maybe an example of how that works in your systems.

Ian Arundale: I think there’s probably two key ways to do that. The first is the producing system changing its status code. I mentioned basically having a way to signal to the system that it’s starting to fail, so changing your HTTP status code is one way to do that. The other way is using the time budget. Essentially using timeouts in the calling system to keep track of how quickly its dependencies are responding. When you can detect there’s a bit of drift or that a system is starting to slow down, you can then start to pivot the behavior within that system.

Matthew Clark: Hard to get right, though. It’s very easy to turn a small problem into a big problem. Something like exponential backoff, where you start going, I’ll just give it an increasingly long amount of time before you recover, can be quite hard to get that right in practice, we find.

Participant 3: Presumably between the COVID 2020 broadcast and then 2024 election, you fixed all the problems and went in. How long did it take to change the architecture from what it used to be to what it is now? How did you test it? How did you test for scale? How do you simulate that kind of activity?

Ian Arundale: In terms of the specific issues that we had during COVID, we had to respond super quickly because we knew that once we’d failed on that first, essentially, PM announcement, that there was going to be more. There was only going to be more of these coming on. We had to react really quickly. We made some tactical decisions around changing the way that the routing works. Going back to that point about how do you get a system to back off, we started to implement essentially backup origins at the CDN level, so that when we didn’t get responses quickly enough, we use the CDN to switch to a static version of the application. That’s a long way of saying we fixed the problem there and then very quickly. Then there was this more strategic view about, that can’t be the problem solved forever. How do we take a more holistic view across the estate? I’m not going to say that we’ve solved all the problems, but we’ve definitely made big strides in that direction.

Participant 4: You mentioned that, coming back to this elasticity question, you guys are all on EBS, and this is great, but it does come with a very hefty price tag. How do you land on this decision on build versus buy when it comes to the cloud?

Matthew Clark: You’re right. We do make a heavy use of AWS. We do do quite a lot on custom optimization. We have a build first, optimize later philosophy, because it’s very hard at first to know just how much something’s going to cost. It’s very hard to predict. We’ll do the optimization afterwards. We have found that serverless costs on balance don’t work out more expensive than just having virtual machines. Because although the cost per millisecond or second or whatever is more, you’re not paying for loads of compute that’s just sitting there 10% CPU load, waiting for extra traffic to come up.

Ian Arundale: The answer to how we test is we essentially run a mock election before election night. We’ll scale up all of our systems and basically do a huge load test. Not just of the audience-facing services, we’ll scale up the whole environment, all of the backend APIs. We do it across web, mobile, and TV. We do the whole shebang. Hopefully whatever happens on the night, we’ve already practiced a few weeks in advance.

See more presentations with transcripts

Architecture in the Lead: Scaling Today, Shaping Tomorrow

Transcript

Background

Non-Functional Requirements (NFRs)

1. Elasticity

2. Resilience

3. Security

Do Architects Have NFRs Too?

What Did Happen at the 2019 UK Election?

Recap

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Best in Show: Our 32 favourite gadgets from IFA 2025

I Tried Lenovo’s Legion Go 2: The OLED Screen Dazzles, the Speed Is a Question Mark

These underrated photography apps are free and you probably haven’t tried them

The 15-inch MacBook Air M4 is down to its best-ever price at Amazon, but not for long

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Background

Non-Functional Requirements (NFRs)

1. Elasticity

2. Resilience

3. Security

Do Architects Have NFRs Too?

What Did Happen at the 2019 UK Election?

Recap

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News