Kraken's Serverless Architecture For Keeping The Grid Green

Transcript

Kevin Bowman: This is the London that you’ll hopefully recognize. Think about this for a moment. What would it take for all of the lights in that picture to go out? The UK grid powers all of those lights, and it supplies, on average, 30 gigawatts of power. Average over the whole of last year. Around 15% of that is used by London. Averaging over the last year, 40% of the grid is powered by renewable energy sources. If you take those away, London would go dark. The biggest renewable, wind, is highly unpredictable, second by second. Why doesn’t London go dark hundreds of times each day whenever the wind drops?

I’m Kevin. I work at Kraken Technology, part of Octopus Energy, for a section which makes technology to prevent exactly that kind of scenario from happening. We can’t control the wind, and we can’t control clouds crossing in front of the sun, so we need some supporting technology to make sure the grid stays stable and keeps energy flowing to where it’s needed. This has to be reliable, and it has to be scalable. Having enough energy on the grid, not too little, and also, interestingly, not too much, is really not something to be taken lightly. However, we’re a fairly young company, and my bit of Kraken is still quite small, not what you’d think of as a traditional massive energy conglomerate. We need to stand on the shoulders of giants and avoid reinventing things just for the sake of reinventing it.

For us, that means using cloud technologies, but not only that, we use serverless managed services in the cloud and focus on what value we’re adding on top of those services that we’re using. In this session, I’ll go through a quick introduction to energy markets and the business challenges that we’re working with. I’ll talk about how we address those challenges architecturally and how each architectural bit addresses one of those bits of the energy market challenge, and also how we make best use of what’s available with our little bit of magic on top to play our part in keeping the grid online.

How Do We Fix Renewables Like Wind and Solar?

Let’s start with, how do we fix renewables like wind and solar? Wind speed bursts up and down. You never know exactly what’s going to happen, which might be similar to problems that perhaps people have with lots of traffic to a website when a big marketing message goes out, or a flood of traffic happening to a website. We have challenges where wind goes up and down all the time. We can’t predict to the millisecond when the wind will drop or when a gust will happen, so what we actually need to do is somehow smooth out those kinds of peaks and troughs. Battery storage is crucial to this. It’s advanced in leaps and bounds over the last 20 years, but batteries alone are not enough to solve this problem. How does a battery, for example, know when it should be charging or discharging? Or sometimes, importantly, neither charging nor discharging, just holding idle because it’s going to be needed in some amount of time. The grid is AC, alternating current. The voltage goes up and down in the UK 50 times a second. That’s fundamentally because of how transformers work, how steam-driven power stations work.

One hundred years ago or more, that’s all there was. That’s how the grid worked. That’s how it was invented. That’s how it still works today. Batteries and solar farms are DC, direct current. They have a very flat-line voltage or at least not a controlled 50 hertz up and down. To convert between these, there’s a thing in the middle called an inverter, also acts as a rectifier. That is what makes these voltage conversions between the AC-DC styles. Because of this, you can’t just somehow hook a battery onto the grid and hope that it’ll act as some magical reservoir all on its own. There has to be some kind of external system which is telling that inverter when to take energy out of the battery and when to put it back in. The first thing we need in our architectural picture is some kind of control system capable of telling the inverter what direction energy should be flowing between the battery and the grid at any given time.

On an instantaneous basis, the grid cares about power, typically measured in watts or megawatts in this case. Interesting contrast to Jinsong’s talk where he was talking about embedded devices which are dealing in microwatts. This stuff is megawatts, a trillion times bigger, I think, if my math is right. This measurement is why you’ll often see new battery sites reported as being something like 50 megawatts, 100 megawatts. The most recent one to go online was 200 megawatts up in Scotland. We think of batteries as being about energy storage which is measured in watt-hours. It’s a unit of capacity, or megawatt-hours. Think of that as the answer to the question, how long could this battery keep supplying 10 megawatts for?

If it’s, for example, a 75-megawatt-hour battery, the answer to that would be 7.5 hours. We need to tell the battery and the inverter what to do now, but we also need to be planning ahead to know what we can do in the next few hours, and that means we need to know how much energy the battery has remaining as well as how much it’s giving out now. We need to know the energy. We need to know the power. We also need to know how much can be added to it if it’s running down. We need some kind of battery telemetry in our architectural picture. We need regular reports of various metrics about the state of charge, how much energy is in the battery, how much power is flowing in and out of the battery. Bear in mind that’s not always the same as what we’ve told it to do. There can be other environmental conditions which might be changing what the battery is actually doing compared with what we told it to do. We’ll talk about that a little bit more.

At this point, we’ve got a system which can theoretically control and monitor a battery. That’s enough for a human to manually manage a battery based on what they think the grid is doing, but really, we want some kind of autopilot in the middle. We want humans to be able to state their goals or their intent and then let some magical algorithm figure out the low-level details about what actually needs to be happening with the battery. To understand this a bit more, you’ve got to understand a little bit about the energy market. Energy is traded just like any other commodity, so the goals of this kind of autopilot are generally either make the most money if you’re buying and selling energy or keep the lights on if you’re like a regulator or if you’re operating an energy system, or if you’re an end consumer, you really want the lights to stay on.

The Energy Markets

On the face of it, energy markets are a very traditional supply and demand market. In the UK, each day is split into 48 different half-hour slots, and each of these is called a settlement window. Typically, up to about two years in advance, various electricity generators will offer some energy into an auction and energy companies like Octopus, for example, will bid to buy as much energy as they think they’re going to need in that 30-minute window whenever in the future that is. They’ll offer to buy that from various generators.

From a week in advance up to the settlement window itself, there’s some additional trading happening where everyone readjusts their buy and sell amounts based on updated predictions and whatever energy they actually manage to buy and sell in advance. That’s all pretty traditional commodities trading. When the settlement window actually arrives, then typically everything changes and the grid operator, called NESO in the UK, comes into play. No one can know to the millisecond how much energy is going to be needed within any given settlement window.

For example, what if a soccer match is cancelled at the last minute so no one’s opening their beer fridges and suddenly causing a big draw on the grid? What if Eurovision overruns and everyone waits another 20 minutes before going to the kitchen and switching their kettle on? That could take that into another settlement window, into a different period of time, or the instantaneous draw could be at a different point to where you thought it was going to be. To cope with these scenarios, NESO operates a whole series of standby markets. They have a responsibility to keep some amount of capacity on standby, currently typically around 1 to 2 gigawatts, to be called on at little or no notice in case of spikes of demand, and also to have available some places to send energy in case there’s too much flooding onto the grid. Either of these could be batteries or it could be other places like connections to overseas grids. The UK grid has connections to a whole bunch of different grids outside of us, like Norway, or France, or Iceland, or Ireland. This whole mechanism is generally known as the balancing mechanism.

There’s a different kind of energy market, which is for much faster responses to the in-the-moment ups and downs. Load on the grid impacts the AC frequency. It should be exactly 50 hertz in the UK, and again, it’s based on those kinds of spinning steam-driven generators like coal and gas, all being perfectly synchronized and spinning at exactly 50 hertz. When the voltage gets to the top of that waveform, every spinning turbine in the country is at the same point in its rotating cycle, and they should all be rotating in synchrony with each other.

Too much load, and the generators typically struggle to keep up. It’s effectively like backpressure back to the generators, and that means that the frequency starts to drop, and that’s a problem. NESO has a responsibility to keep the grid frequency within 1% to 50 hertz, and how close we are to that magical 49.5 hertz to 50.5 hertz window is used as a signal around the whole grid to how desperately we need these reserves to kick in, which manifests as a range of different in-the-moment energy markets into which you can in advance offer your battery as a standby energy source. NESO will typically pay different amounts depending on how fast you can respond to that grid frequency changing, and then basically charge that back to whichever generator or domestic supplier or consumer.

One of them, by definition, predicted wrongly how much they would use or supply within that window, and NESO effectively charges back to whoever got it wrong the cost of operating these markets. These markets are known as the reserve services in comparison to the balancing mechanism, which is the planned in advance mechanism I was talking about earlier.

A recent example of when these reserve services were needed was actually earlier this year. An incident happened on the 30th of January. This is a map of the transmission system, which is the super high voltage parts of the UK grid, typically 200-and-something thousand volts or 400-and-something thousand volts. The UK grid has various connections to overseas grids. This one coming off the bottom is one of the connections over to France. This is called IFA2. That connection tripped unexpectedly, and at the time it was supplying almost exactly 1 gigawatt into the UK grid. When that tripped, we lost a gigawatt unexpectedly off the UK grid. The grid frequency at that point dropped dangerously low. The top graph here shows the frequency, and you can see exactly when IFA trip happened, conveniently almost exactly at 9 a.m.

On the bottom graph, you can see how much power various batteries on the grid were supplying at any given time. You can see when the trip happened, a whole bunch of batteries suddenly, almost instantaneously, jumped up to full power, regardless of what they were doing beforehand. Then the frequency on the grid then was protected from dropping below that 49.5 hertz ultimate danger zone. Then over the following 10 minutes or so, various other balancing markets and dynamic responses kicked in to bring the grid back up to a comfortable level. What you can also see in this graph, interestingly, is just before that 9 a.m. trip, actually the frequency was starting to drop, so there was clearly more load on the grid than was expected, and the regular balancing mechanism was starting to bring some more batteries online to try and balance the grid generally. Nothing like as much as when the trip actually happened, though.

The Trading Interface, and Markets Connector

All of this is to explain the choices that battery owners are making. They don’t really want to be choosing exactly when a battery is charging or discharging, but they actually want to be trading their battery, their asset, into the various energy markets and then letting us use that information to decide whether to charge or discharge that battery at any given moment. To do this, we need a few more things in our puzzle.

Firstly, we need a trading interface so that the battery owners can tell us their intent. They can tell us basically things like what they’re actually paying for energy in each of the up-and-coming settlement windows. We also need what we call a markets connector. In the UK, this is our connection out to NESO, and this is so we can tell NESO exactly what happened in any given time, typically used for market settlement, and also so NESO can tell us that a battery which was waiting to balance the grid is actually now needed. That all works great for in-the-moment decision-making, but we still don’t have that autopilot piece. We still don’t have that piece that lets us plan ahead something to replace that pink box in the middle.

For example, if we know that we’re going to need some energy in a battery in a few hours’ time so that we can sell it perhaps at a high price or at a high carbon time for the grid, and we know that it’s going to take 45 minutes to charge the battery by that amount, we should be able to pick the cheapest 45 minutes between now and then to charge the battery. We might also have other constraints which limit or guide us as to when the best time to do that is. The battery, for example, might be traded on other markets over the next few hours, so we need to make sure we can still meet those commitments while also putting the relevant amount of energy into the battery. Or the battery might have, for example, a warranty-based limit on how many times it can be charged and discharged each day. Or there might be some maintenance window planned ahead to take into account.

All of these kinds of goals and constraints get packaged up, crafted, and sent into what we call the optimizer, which is basically a kind of linear optimizer. It’s basically a traditional SAT solver, and we craft the constraints. We know what constraints we have over the coming period of time. We ask it for a goal, and the goal might be, how do you make the most money from this? How do you use the least amount of carbon in grid generation terms? We throw all of this into the optimizer. It crunches it all together, and it basically makes a recommendation on what to do.

There may be several options which are possible, or there may be no solution at all, but the optimizer will tell us within a certain confidence interval what the best thing to do minute by minute is over the next 24 hours. We run the optimizer independently for each battery because they might, for example, be traded into different markets. There might be other differing constraints between them. We rerun the optimizer every time something changes. For example, a circuit breaker might flip, or energy prices might change, or the human might come onto the trading interface and tell us that something about their understanding of the world has changed, so we need to rerun the optimizer. Failing all of that, we rerun it every five minutes anyway, just in case. We run it for a 24-hour window, so we’re always populating 24 hours ahead on a moving window basis.

Infra – AWS

Where do we do all of this? I mentioned earlier that we like to build on top of things which already exist, and for us that means being cloud native. Crucially, we don’t use the cloud as just another data center, but we take advantage of the many different managed native services that are available inside of a cloud environment. AWS are much better at running plain infrastructure than we are, and they invest huge amounts into things like databases and compute platforms. That investment is not just into functionality, but it’s also into the non-functional elements, things like the reliability of services, the security, the data integrity. These are all crucial when it comes to things that can impact the lives of tens of millions of people if it can go wrong. What’s really important, though, is to remember that nothing’s ever perfect, and AWS don’t guarantee 100% reliability or perfect security in any of their services, but they tell you the boundaries in which they operate. They are very clear about what you should expect, and that expectation is not perfection.

They tell you, for example, a certain level of reliability, a certain shared responsibility model in their various non-functionals. They also tell you when failures happen, which they will, they tell you how you can best contain the blast radius of the failure by using their various fault isolation boundaries. These could be things like availability zones, it could be regions, it could be different AWS accounts for different services. This information is crucial to us designing our systems, knowing that the layers underneath the systems we’re deploying will not be perfect, but the more we know about the imperfections and the expected failure modes, the more we can account for them and plan ahead for when things go wrong.

If we were just running all of this ourselves in a data center somewhere, even if we’re running it on EC2 VMs, we would be spending a huge amount of time and effort on managing things which could already be managed for us, and there’s no need for us to be the ones solving problems, which AWS have already invested probably vastly more than we would into solving these kinds of problems.

What kinds of things do we use in AWS? We are very aware of Conway’s Law. Conway’s Law describes how a company’s org structure and their architecture are intrinsically linked through the communication channels which are created. Using this knowledge, all of the components that I just mentioned in that high-level architecture earlier, these are all owned by different teams within our company, and they’re all deployed as separate microservices. This is just an example of one of those microservices. Where possible, they are coupled asynchronously, represented here by AWS EventBridge, just a typical queue, any kind of queue, it can do broadcast or it can do traditional queuing stuff. It’s a great way to decouple different services, but sometimes there has to be a synchronous interaction between different microservices, and at that point, each of our services calls the other service via their own front door, shown here with an API gateway.

All of those synchronous interactions typically are traditional HTTPS APIs going through an API gateway. By having this very clear boundary of our services, our teams can be confident that there’s no external system surprisingly fiddling around inside the databases inside this microservice, which could mess up some of that crucial logic. Virtually all of our interactions with the world are event-based, whether that’s through telemetry that’s coming in from batteries, or through market prices changing, or customers changing some parameters. Being very event-based lends itself incredibly well to then using serverless compute, here shown with AWS Lambda.

Obviously, even in serverless, there’s always a server somewhere underneath, but by using many Lambdas, we can keep our teams focused on the actual business logic that needs to be inside of those Lambdas, and we can purely write the code that needs to respond to any given event, and avoid having lots of boilerplate or event loop code that we have to write and maintain ourselves. AWS have many people making all of this stuff run smoothly. Typically, our customers won’t buy from us because we’re amazing at managing VMs. That’s literally not the business that we’re in. Really, we shouldn’t be investing our time and effort into that side of things. We rely heavily on AWS Lambda in all of this.

If you add in a bunch of other AWS components, and at a really simple level, this is generally the stuff that makes up each of those microservices. To remind you, there’s a common edge after which we route the request to an API gateway owned by the relevant team. That invokes that microservice. That will be run by one or more Lambdas, responding to all those various kinds of API calls. They’ll then, behind the scenes, use something like DynamoDB or S3 for persistence. Again, DynamoDB is a managed serverless database. We’re not managing Postgres or MySQL ourselves. AWS are better at doing that for us. We use DynamoDB as their abstraction layer on top of their databases. Or there’s EventBridge there for asynchronous interactions across the estate.

To dive into an actual pattern that we use, we have a model for how we respond to things changing. Again, perhaps prices are changing or a battery might be responding to a grid frequency drop or something like that. We split the logic in all of our components into two separate Lambdas. These have very defined different responsibilities. One side is the trigger side and one side is the action side. These will typically be joined by a queue. The trigger side here is watching for the things that might change. All it’s doing is deciding which batteries or which assets probably need updating. All it does is puts those batteries onto a queue, but without any other data. Crucially, it’s not saying what has to change about those batteries. It’s just got some stimulus and decided which of those batteries has to have something done to it.

The other Lambda then processes that queue and calls out to various other services to get up-to-date real data that it needs to then decide what the state of that battery should be. Then it basically recalculates the instructions needed for each battery based on what we know about the world at that point. This separation between these allows us to scale each side of that problem independently, and also to manage things like timeouts on other APIs and guaranteed delivery. All the things that you need to worry about when dealing with a serverless queue based asynchronous world. We make heavy use of dead letter queues, for example, to reprocess data as we need to. Dead letter queues, if a message has been taken off a queue but not processed in time or if it’s been on the queue for too long, then typically we make heavy use of the dead letter queue pattern. Then we’ll have a separate Lambda which is doing something relevant with those messages that have timed out or somehow not been picked up. That will typically need some different processing.

To dive a little bit deeper still into one of these services to see how it actually works, let’s go back to that overall simple architecture diagram and highlight this component here. This is what we call the markets connector. We have a different one for each country, for each geography that we work in. In the UK that means how we talk to NESO, being the grid operator. If we zoom into that box, internally it looks a little bit like this. Don’t worry about all the detail in there.

Crucially, the thing to notice is that the triggers are all these things down at the bottom. In this case, the triggers for this bit of the component, are mostly all time-based. There are triggers which cause some kind of thing to happen every day to figure out what has to happen the next day, every half hour for every new settlement window. There’s also a listener in there for signals coming from another microservice which might affect this one. Each of those triggers has a dedicated Lambda for figuring out whether something needs to change, but crucially not what that thing is. They just put messages onto a queue.

Then, once the message is on that queue, they’re picked up by various other action-specific Lambdas. These are the ones that will query various other data sources to produce some update instruction and then send it in a way which is idempotent, meaning we can resend an instruction multiple times and the same result will always happen. An example in our case is that instead of sending an instruction like, for this battery, make it discharge by another 20 megawatts, or increasing the discharge by 20 megawatts, we don’t do that. Instead, we would say, make the discharge of this battery be 50 megawatts. It doesn’t matter how many times we send that message, the result of that will always be the same. That then gets sent to the system over here, which is the bit that actually communicates with the grid. Note there’s a few different queues over here because messages are flowing in both directions, to and from NESO. There’s also some built in handling for things like fault recovery and dead letter queue processing that I was talking about earlier.

There’s also some comms between this microservice and other components in our system, but that’s always via their APIs, their async queues, their very defined edge is how all of our microservices communicate with each other, again like I was talking about earlier. This architecture also makes a lot of other things easier. For example, testing or scaling the individual components. From a testing perspective, each Lambda has a very clear set of possible inputs, and sanitizes those inputs to make sure that everything is as expected. We also know what the output from each individual Lambda should be for any given set of inputs, so repeatable testing is also quite easy to do.

For scaling, all of the Lambdas are independent of each other and are generally very well decoupled. Even the ones which are synchronously calling other systems will usually just be looking up some data. It’ll usually be pretty fast. There’ll be a Lambda on the other end which can very quickly respond. Typically, we can throw in some simple load testing at all of this and we’ll know pretty well what the whole system can handle and how much memory and resources each individual Lambda needs to be given.

Telemetry

One of the systems that looks a little different from the others, I’ll also deep dive into, and that’s this one, that’s the telemetry component. This is handling, in aggregate, many thousands of data points each second, and that data is used by a variety of other systems. If we have a closer look at that, this component takes streaming input from each individual battery, which, for us, comes in over MQTT, which is a simple message-based message bus protocol. We can also take metrics arbitrarily from other systems over an API interface, as described earlier. There’s a huge volume of data coming into this component, but we don’t mind whether the individual data packets have a single metric, a single reading, or they could have loads of different metrics in a single data packet, or we could take even a batch of readings for the same metric over different times. Doesn’t matter, they all just get put onto a queue and are then processed by various dedicated Lambda functions. These will validate the data, they’ll sanitize it. They’ll split it into known discrete chunks, regardless of how many things were in that incoming data packet, and they’ll store it for later use.

Speaking of storage, it all has to be stored somewhere, but there’s many different other components using this data, and they all have different use cases. Since the performance of other components is depending on this one, depending on how the data is going to be used, we actually store it in several different places. A lot of the time, other components only want to know the latest value of any given metric. For example, what is the percentage state of charge of this battery now, or how many megawatts is this battery over here doing at the moment? We just write the latest value to Redis, we put a simple API in front of it, and then anything else can query the latest value of that data. Redis being a fantastic in-memory store, primarily for key-value data, that’s what we have, but it also is reasonably good at well-structured values for that data to be looked up on a key basis. We don’t want the other components to be querying that Redis. We always want there to be an API or a stream at the edge of all of our services.

Instead of just exposing Redis, we’ve got an API which is powered by Lambda, which is itself querying that Redis. Another use case for this data is that something might want a range of metric data over a period of time. All of this data is lots of different time series. We put it into a suitable database here, shown as like Influx, which is a great time series database. It’s optimized for looking up and processing data based specifically on a timestamp or a range of timestamps. Again, with it being exposed via a well-defined API, we can switch out what that datastore is as necessary without needing to coordinate changes with the other teams who are managing the other components who are all calling this telemetry system. There’s also a long-term storage requirement perhaps for archival or for some data lake style processing. Since we’re in AWS, then shown here as S3 is a great place to just land data and can be used to populate other things in the future as the need arises.

Then the final use case for this data is to expose it on a stream which can be used by other microservices. Note, this is not the raw input. This has been sanitized. It’s been normalized, filtered, split into known sizes. It means that anything consuming this doesn’t have to do all of those steps itself. It can get a very clean feed of the data.

The Communication Flow between Battery and Grid

If we look back at the high-level architecture, the optimizer here only wants the latest reading for any relevant battery that it’s optimizing, so it can get that super-fast from the API that was backed by Redis. There’s one extra bit on this whole picture that I’ve not really mentioned yet. I’ve talked about how we make decisions in our cloud environment. Somehow those decisions need to actually get to the battery for the electrons to flow in and out and for the battery to be effective on the grid. There’s a couple of ways to do this, but the one we most commonly use for these big grid-scale batteries is to put a custom piece of hardware physically onto a battery site capable of controlling several batteries, but we control that hardware. This hardware has a few things in it.

The first thing it needs is some local comms to the battery itself to tell it actually what to be doing. Generally, this is over some serial connection, Modbus being an example of that. The hardware also needs an ISP connection, and this is typically the direct wire down to the site or using a 4G modem that’s inside this hardware box. We also need a frequency monitor in this box so that we can respond instantly, instantly within milliseconds, if the grid frequency drops or goes up. If the grid frequency is the signal to us needing to change what the battery is doing, we need to be monitoring that grid frequency. Note, that response has to be faster than what we could do by sending that data up to AWS, processing it, and sending it back down again. It also has a redundant pair of industrial PCs in there, which is basically where the logic is that keeps all of this stuff running locally. These PCs are connected up to our cloud environment using MQTT.

Again, we prefer managed services, so AWS IoT is a managed MQTT server, so we use that at the receiving end, and this gives us an easy way to push the next 24 hours of instructions down to the PCs where they’re cached and operated on locally and also to stream that telemetry back to us so we always know what’s going on. That’s a box on a diagram. In real life, these boxes look more like this. It’s effectively a box with a bunch of boxes and wires inside of it, but in there is all the things I just mentioned, so the pair of PCs is these two things up in the top right. Those are just traditional Linux PCs running custom application that we write. There’s an internet connection down here in the corner. In this case, that looks like a wired connection, so we got wired out to this battery site. There’s a pair of frequency meters here which are constantly measuring that frequency from the grid.

Then down here is the interface out to the batteries themselves. One of these boxes can control multiple grid-scale batteries, and like I say, we preload 24 hours instructions onto them so that they can operate independently from our cloud environment in case there’s any problems. This gives us breathing room in some of our non-functional requirements because if our cloud environment went away completely, then the on-site hardware will still run things for the next 24 hours while we recover our cloud operation. Obviously, we couldn’t make changes to what those instructions are, but that battery site will still keep running on its own as we’d planned at that time.

EVs and Making the Grid Greener

Loosely, that’s how we enable renewable energy sources to be part of the grid without blacking out London every time a cloud covers the sun or the wind drops. It turns out there’s another way we can use this setup to help the grid be as green as possible and also to allow home consumers to get cheaper electricity, everyone wins. We’ve built all of this control and optimization to make the best use of grid-scale batteries, but in the UK today, there are typically around half a million electric vehicles or EVs which are also placing demands on the grid. Each one of these EVs is typically just a battery as far as the grid is concerned, and although they’re nothing like as powerful as the big grid-scale batteries, why can’t we group together a few thousand of these at a time and treat them as if they are a grid-scale battery? Effectively, that’s what we do, basically turning this picture from earlier into this picture. There’s a few differences in how we do this. Home EVs can’t generally discharge back to the grid, although that’s coming.

Individually, they can basically either charge or pause, but we can turn up or turn down the effective amount of charging, for example, by taking like 500 of these EVs and say, actually, can you delay when you’re going to start charging by another 30 minutes? Or by taking a few thousand of these EVs and say, actually, can you charge earlier in the day because we know it’s a lower carbon-intensity time for the grid? There’s also a high likelihood that an EV might not be plugged in when we want it to be, or it might already be fully charged, so we need to allow for quite a wide tolerance in our expectations compared with how we manage grid-scale batteries.

To cope with these facts, we split the optimizer into two different parts. We split all of the EVs into a number of different fleets, typically of a few thousand each, and then we run the optimizer as a first pass on those aggregate fleets of EVs, effectively modeling that fleet as a single industrial battery. Once we have an idea of what that fleet should be doing, we then run a second pass of the optimizer to decide for each individual EV what it should be doing so that the fleet can generally meet its goals. That second pass optimizer for the relevant fleet gets rerun every time a car is plugged in or unplugged, and its primary goal is to meet the customer’s request about how much a car should be charged to by what time of day.

Then, as a secondary goal, it’s trying to meet the needs of the fleet so that the fleet can meet its goals that we’ve defined or figured out in advance. Another difference is that we can’t use local hardware to control the car’s charging in the same way as we have that box on a grid battery site. Instead, we call APIs provided by each manufacturer, which are already set up to tell those individual EVs what to do. This means that we can offer cheaper electricity to customers in return for them letting us decide exactly when their EV or home battery is charging. It allows us to buy more electricity at the cheaper rates or, more importantly, at the lower carbon intensity times during the day or night. Some of you here today actually might be taking advantage of this. If any of you have heard of the Octopus intelligent tariffs, like Intelligent Octopus Go, this is fundamentally how some of the brains behind that system actually works to figure out when your EV should actually be charging.

Key Takeaways

In summary, three takeaways from this. Don’t reinvent things, particularly if you don’t need to. You can build on top of what’s already there, but if you do that, make sure you really understand the limits and the operating characteristics of what you’re building on top of. Non-functional requirements are super important. This stuff has to be reliable, it has to be secure, and it has to be scalable. If you’re not thinking about those kinds of things from the start, then you’re going to have a bad time at some point through your journey. Then, finally, the green energy revolution is happening and there’s loads of really interesting tech involved in making it successful. Everybody from end users all the way through to industries can play their part in making sure that our children have a future on a planet which we haven’t broken.

Blackhillock Industrial Battery Site

This is the newest industrial battery site in Europe that went online. This is a site in Scotland, somewhere in between Aberdeen and Inverness called Blackhillock. This currently is a 200-megawatt battery site with 400 megawatt hours of storage to be expanded next year. That’s typically what it looks like. Each of these white boxes is roughly the size of a container, like you might think on a container ship. Each of those contains something like 150,000 equivalent of what you think of as a traditional AA size lithium battery. That’s a 200-megawatt industrial battery.

Questions and Answers

Participant 1: Along the way, were there any big incidents, hard lessons learned that got you to that level of reliability?

Kevin Bowman: Yes. I think what’s crucial with those though, is like I was saying, to think about the isolation boundaries. Never assume that everything’s going to work. I always say that incidents are great. I’m maybe different from the herd in saying this, but incidents in any company are a way of telling you what problems you already have. If your company, if your architecture has an injury. If your arm is bleeding and you don’t know it’s bleeding, an incident is the company equivalent of telling you that you have some kind of injury and that you should be doing something about it. As long as your architecture contains incidents well enough, they’re a really good signal, they’re a really good feedback mechanism as to where in your architecture you should be fixing things.

Participant 2: The box that you’re putting on grid-scale batteries, are those only for grid infrastructure that Octopus own, are responsible for?

Kevin Bowman: No. I work for Kraken. This is the Kraken logo, which looks very similar to the Octopus logo, slightly different. Kraken is like the technology bit within Octopus and is a company with its own right. It’s specifically set up to be reasonably operationally independent from the rest of Octopus so that we can provide these services both to things which Octopus owns, but also to other companies. The ultimate mission of the whole Octopus group is to put a big green dent in the universe. We can’t do that just by allowing Octopus to use this green technology. We have to also allow other energy suppliers, everyone else in the energy industry to use cool green tech. Actually, we as Kraken don’t own any of these batteries and we have customers outside of Octopus who are using this.

Participant 2: Would there be multiple suppliers with these boxes attached to a single?

Kevin Bowman: Industrial battery sites have a really complicated operating model. There can be one company who owns the battery, another company who’s effectively leasing the battery, another one who’s operating the battery, and then a company like ours running a SaaS platform to actually do the optimization and things on top of that.

Participant 2: You described breaking EVs into fleets. Do you rebalance the fleets based on known predictable patterns of charging or do you not have that information available?

Kevin Bowman: We do. If anybody’s on Intelligent Octopus, you’ll be familiar with this. Each individual EV owner tells us, I want my car charged to this amount by this time of day. It could be like, I need an 80% charge by 7 a.m., because that’s when I drive to work. Individual by individual, we know what that is. That goes into the optimizer. That information goes into both the fleet optimizer as a general rule, and then into the individual optimizer because that’s the primary goal of the individual stage of the optimizer.

Participant 2: Do you rebalance what the fleets are based on how they’re known to be used and charged?

Kevin Bowman: Yes. This is an area that we’re probably going to go into next is some kind of more intelligent rebalancing of all the fleets themselves. In reality, those fleets, once they’re set, are not changed very much at the moment, but this is an evolving system.

Participant 3: How do you feel about using so much AWS services? Aren’t you afraid about vendor lock-in? Then, if so, what’s your exit strategy?

Kevin Bowman: It’s a conversation that happens quite a lot. Yes, we rely on a lot of AWS services. We’ve consciously decided to rely on the managed services, like say Lambda, DynamoDB. There’s pros and cons in every decision, in every architecture, in every business. We get a lot of benefits of that in terms of, like I say, not having to invest in a lot of the skills to make the equivalent of that functionality ourselves.

Obviously, the tradeoff is that we’re locked into those services. I would highlight, though, the microservice architecture, where each individual component within our overall system has a very well-defined interface. If we needed to, if we wanted to lift a component as a whole out and do something else with it, we have those well-defined boundaries, and it’s more a case of just sending into our overall config system the fact that the control API is actually over here now instead of it is over here. We have options. It would cost a fair amount of time and effort to enact one of them, but as a young, small, but rapidly growing company, all part of Kraken, we’re very much enjoying the benefits of those managed services. We pretty much just invest in the kind of USP that we have.

Participant 4: With all of that complexity around who owns the battery and stuff, do you have problems with governance, making sure you’re meeting all those legal requirements of part owning a thing? How do you manage that on that lots of little services thing?

Kevin Bowman: When I was talking about the grid scale ones, this market connector down here is a different implementation of that per territory, per geography, and that’s largely for government reasons. Every country, sometimes subparts of country, has a different way of regulating their energy system, and we need to integrate with that. I was talking earlier about the UK market being split into 30-minute settlement windows, so there’s 48 in each given day. The UK was reasonably advanced in not just saying the market is for a whole day, it’s actually going to be for subsections. I’m sure there was some thinking about it, but we chose 30 minutes. Other countries have learned from this and have chosen 15-minute windows or 5-minute windows in some cases. The whole system can’t make assumptions about the fact, for example, there’s a 30-minute settlement window, but the markets connector needs to be very aware of how that works. We’ve got a little bit of an abstraction point there.

See more presentations with transcripts

Kraken’s Serverless Architecture for Keeping the Grid Green

Transcript

How Do We Fix Renewables Like Wind and Solar?

The Energy Markets

The Trading Interface, and Markets Connector

Infra – AWS

Telemetry

The Communication Flow between Battery and Grid

EVs and Making the Grid Greener

Key Takeaways

Blackhillock Industrial Battery Site

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Best Smart Home Gyms for 2026

Hang on, there’s a Trump Phone Ultra coming too?

Google Photos’ latest feature lets you meme yourself | News

The HackerNoon Newsletter: How to Enter the Proof of Usefulness (PoU) Hackathon (1/23/2026) | HackerNoon

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

How Do We Fix Renewables Like Wind and Solar?

The Energy Markets

The Trading Interface, and Markets Connector

Infra – AWS

Telemetry

The Communication Flow between Battery and Grid

EVs and Making the Grid Greener

Key Takeaways

Blackhillock Industrial Battery Site

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News