Transcript
Hanisch: My talk, “From a Lambda-Lith to an Event-Driven Architecture”, let’s take a look at what we are actually talking about. First, I’m giving you a little introduction. Then we are going to explore, what is actually a Lambda-Lith. We are first going to introduce that term, what it is actually. Then, we are going to reflect on how can we maybe do a little bit better. This is more like the journey we at Siemens did these three past years, so where we started building our systems and how we build them today. To showcase that, I then brought three different projects, and of course, in the end, we are going to wrap it up.
Who am I? I’m Leo, solutions architect with Siemens, involved with AWS and mostly into serverless stuff. That’s also why we are going to talk about mostly serverless architectures.
The Lambda-Lith
When it comes to architecture, as a solutions architect, we always aim for the best solution, around optimal ways, like it’s supposed to cost nothing, it’s supposed to scale high. Back then, this is what our architecture looked like. This is when you get started with AWS, when you get started with serverless architectures, this is probably what you first are going to see. We have an Amazon API gateway, AWS Lambda, and Amazon DynamoDB, and they are great. Serverless services and all good, but there’s no Lambda-Lith yet, or what is a Lambda-Lith? To discover the Lambda-Lith, we actually have to zoom in a bit in the Lambda function.
A Lambda-Lith is more like a word combination of Lambda, because AWS Lambda is our compute platform, and a monolith, because we are building monoliths. We would do something like this. We have our favorite Node.js framework, in our case, NestJS, and there you just define your REST API. Your routes, in this case, we have a POST route for orders, we want to create new orders, and then you do all your stuff in code. That’s how we traditionally build servers, and it’s all good. This time we say, we don’t want to deploy it on-premise, we just put it in the Lambda function, and there you get the Lambda-Lith.
With each of those solutions, there are tradeoffs. With the Lambda-Lith, there’s advantages. When you’re talking about AWS Lambda, you soon get to experience cold starts. There is, because on an initial invocation, AWS actually has to download your source code files, before it’s able to execute it. That time is referred to as a cold start. Then, every subsequent invocation of the same Lambda function, you don’t have that cold start anymore, because the source code files are already there. Of course, it’s easy to get started, because maybe you already built that server for an on-premise use case, and now you want to leverage all the advantages of the cloud. With very little glue code, you can actually manage to migrate your existing applications to the cloud. Of course, every solution ships with advantages and disadvantages.
Some disadvantages are that even though you have less cold starts, because all in a single Lambda function, your cold starts tend to take longer. Why is that? You’re bundling all in one bundle. That bundle is bigger, because everything is in there, therefore, it takes longer. With this type of architecture, we have a REST API, we have the Lambda function in our database, we only have synchronous invocation. Of course, that’s sufficient for many use cases. There’s nothing wrong with that at all.
Of course, as things grow, those projects we’re building, those applications, they’re evolving, we see new requirements coming in over again, it might hit the limit. In the case of AWS Lambda, you actually have a hard limit you cannot avoid. That means the bundle size you’re uploading is a maximum of 50 megabytes zipped and 250 megabytes unzipped. If you hit that limit, you’re really in trouble, and you need to rearchitect your solutions to somehow circumvent that. Of course, the big monolith disadvantage, you have a single bottleneck. Even if you’re using AWS, you’re scaling, you have tremendous scaling capabilities, but if you introduce a bug that prevents your server from starting, it will break that Lambda function, it will break your application.
With that in mind, we have one more thing, AWS has its own take on the Lambda-Lith. That’s nothing I came up with. When we zoom in on the very top left, we actually see it’s an anti-pattern. That raises the question, are we now doing everything wrong, or are we supposed to do better? One quote on the very bottom, I really like because I felt when I read that, it really much related, it definitely increases the cognitive burden for developers, at least that was the case for us. Because if you go back to Lambda-Lith itself, if you imagine this, we have here the product service, we don’t have any error handling. What is supposed to happen if we fail to create an order? Are we rolling back? Should we unreserve the product again? All those things are not handled. If you imagine those four lines of codes with tremendous try-catch blocks, that’s what causes the cognitive burden. With that, we successfully introduced Lambda-Lith.
As we will do with Lambda-Lith and all subsequent projects I’m going to bring today, we are going to think about, why have we used it in the first place and what have we learned by applying this? In our case, we used it because we just did not know any better. Three years ago, that was mostly the start of our cloud journey. We had them deployed very fast and it worked. It scales, it works, and you’re happy.
Again, you have those tradeoffs, you have to have those tradeoffs in mind. Again, like the re-platforming case, if you have already a REST server running somewhere, it’s an easy way to migrate to the cloud. What have we learned? The whole cold start thing. You want to know the implication it ships with, so it’s a tradeoff. Do I want to have multiple faster cold starts, or do I want only one cold start or less cold starts? I accept that a single cold start might take longer to spin up. Of course, what we just said, it’s like a cognitive burden, or can become a cognitive burden. Of course, not all monoliths are, deferring to all monoliths, but speaking how it was for us, and we had that single bottleneck. That’s, of course, something to avoid.
Project 1 – Use Case (Siemens COIN)
That actually brings us to our first project I want to share with you. We call it COIN. The scope of that COIN project is to actually inform fellow Siemens employees about their stock options, and equity plans they are eligible. Once a year, as a Siemens employee, you can subscribe for one of those plans, and then you’re fine. It’s mostly informational web app, but also then you’re allocating your options through that web app. You want to learn. We learned, Lambda-Lith, it’s ok, but maybe we can do a little bit better. Let’s take a look at how we built it. That looks very similar to what we already got. We only swapped out the DynamoDB, like the NoSQL database of Amazon, with an Aurora instance. Aurora is the relational database, comes in different flavors. We used PostgreSQL.
To actually get the difference, we have to zoom in, because now we would again have an API gateway endpoint, a Lambda function, reader and writer instance from the Aurora. We would have multiple instances of those. We would start calling the microservices. Now we have an absolutely fine architecture. We eliminated our single bottleneck, the Lambda function, and we deploy it to production on a Friday, and go home, and are satisfied. That’s what we do. Turns out, on a Monday, you go back to office, and then you realize, we actually not only have a few clients, like HR is actually on an annual basis running email campaigns, informing all 300,000 eligible Siemens employees, “Dear Siemens employees, now please go to COIN, and allocate your stock options”. What’s going to happen?
We are having a lot of clients, causing a lot of traffic on our API gateways, so that’s fine. We’re using serverless technologies. That’s why we use them in the first place, because we are expecting that peak workload. API gateway, I think it has a quota, about 10,000 requests a second, so that’s ok. Our Lambda function also, they spin up multiple instances. That’s ok. Each Lambda function, which spins up on invocation to serve the high amount of requests, is attempting to connect to the database. We still have a single writer, and a single reader. Of course, what’s going to happen? We are trying to open multiple connections, and we will just overload the database, and we fail to serve our requests. What we accomplished, we got rid of our compute bottleneck, but we successfully introduced a database bottleneck. How can we circumvent that?
First thing you actually want to do is, you want to use the so-called RDS Proxy. That thing, what it’s meant for is, it is actually taking care of those multiple connections. This component is the AWS solution for connection pooling for your database instances. Still, we had many, still many requests. It was a very read-heavy application. We then could also swap out a single reader, and introduce a so-called Aurora autoscaling group. What it does is, it lets you track a certain metric, for instance, database CPU load, and then based on that metric, add additional readers to your database cluster. With that, we eventually managed to serve all requests, and survive the annual Siemens email campaign, and all Siemens employees got happy that they could allocate their options. That’s for the first application.
Again, why would you use it? It’s the re-platforming thing. We still have smaller Lambda-Liths, but still somewhat the same. We still only have synchronous execution, and we already managed to decouple parts on the compute side. What have you learned? Don’t use a single bottleneck as a database. It sounds stupid. When I’m talking about that, it sounds obvious, but it was not obvious back then. With the small Lambda-Liths, it’s the same thing. You have to tradeoff the cold starts. I would not say it’s a problem, but you should know the implications that it ships with.
Of course, with the database, a single database is a bottleneck. If you have that, and if it’s an AWS Aurora, make sure you want to use an RDS Proxy if you have multiple connections, and if that’s not sufficient, introduce the autoscaling. That’s why we go to the cloud provider in the first place because we want them to do the scaling for us.
Project 2 – Use Case (Siemens DVM – Digital Visit Management)
That actually leads us to the second project I want to share with you. After we eventually managed to deliver COIN, we got a new project, and it’s called DVM, it’s the Digital Visit Management, the abbreviation for that. What do you do with that? With Siemens, we have a lot of business partners we want to get involved, we want to get in touch, we want to do business, and we want to invite them because we want to show what is Siemens capable of doing. We have new facilities, new factories, like Industry 4.0, like we have those showcases.
For instance, I want to say, let’s invite the whole InfoQ attendees to a Siemens site, and then I’m going to show you what we can all do. This tool allows me to plan that. I can schedule meetings. I can schedule tours through our factories, you get QR codes so that you get through the gates. As you can imagine, I should not be capable to just schedule a meeting, or such a tour for 100 people, there should be an approval process. Probably my manager wants to know what Leo is planning on this event, on this conference. This is a lot of approval, a lot of states, a lot of management needed. That’s actually a screenshot of a first version of a state machine we had for all those approvals. You can even imagine, one of those small boxes even unfolds in its own state machine. It’s a lot to consider.
You have different factories, different interfaces all over the place, and you somehow want to manage it. We need to build an approval workflow. The good thing, AWS got us covered when it comes to approval workflows, namely, AWS Step Functions. Step Functions, I refer to it as the workflow engine that AWS provides for us. What you can do is you can define a workflow. In this example, we have the start and the end, and we have a single action, we want to emit an event to Amazon EventBridge, so the event bus. Usually, when you just normally invoke such a Step Function, it does all the steps you’re defining. We’re writing something to the database, an additional API call, putting an event to EventBridge, and in the end, it will succeed, and the execution is done.
However, there’s this wait for callback mechanism I want to introduce a bit more in depth, because with that enabled, if you think about having an event, we can link that Step Function to an event bus, so whenever an event occurs, that Step Function starts. Usually, we emit an event, again, on the event bus, and we end. However, with that wait for callback mechanism, we actually emit still that event, but we pause the Step Function, so we don’t finish the Step Function just yet, we pause it. That event that was emitted, that includes a token, and now we’re waiting. What you usually do is you subscribe to another listener, in this case, a Lambda function, to that event, including the token, and now you want to have that Lambda function to save the token to a database, because now we are waiting for an approval.
Is Leo allowed to invite all the people to the next Siemens site? Yes or no. We don’t know. Of course, when have you asked your manager the last time for an approval, maybe for a vacation, to come here to this conference? It can take some time. The good thing about this type of workflow is, while you’re waiting for the approval, you actually are billed almost nothing as infrastructure costs. The only thing you’re billed for is a little bit of storage because you store the token within a database, but you have no compute costs, because the Step Function is paused. EventBridge is like the serverless event bus, so if there are no events, no costs. The same goes for the Lambda function.
Then, when I’m approved to show you the next Siemens site, we again have an event on event bus, a Lambda function listening to that, reading a token from the database, and then finally submitting the token back to the Step Functions API. Just now, once the approval was granted, the Step Function would continue, and in this case, it finishes. Of course, now you could do everything else, any subsequent things, like sending out emails, phone confirmation of the approval, those would be like the usual subsequent steps you want to do.
With that in mind, we actually can take a look how we built our architecture. Again, we want to do microservices, now we’ve learned, don’t use database, a single bottleneck, so now we moved databases to each microservice, so each microservice has its own database. Also, microservices, just to visualize it, it’s not only a Lambda function API gateway, it can have multiple components, like queues, additional listeners, and so on. Then, of course, we have EventBridge for our choreography, so now we can actually communicate in an asynchronous fashion. Then the counterpart orchestration, we’re using Step Functions for that. In our case, we put that Step Function in a scope we called the orchestration service, but we’ll see how that turned out. Now, again, we can think of why have we used it. Why did we choose such a type of architecture?
First of all, asynchronous communication, so everything is asynchronous, so we want to also be able to implement those requirements. Also, for us, like this whole event-driven mindset was also closely aligned to how we actually think or how we would receive the requirements. Because it usually was, when a booking was approved, please send out an email to the following stakeholders. When the catering is scheduled, then we need to feed that back to some other channel. Those requirements naturally were defined with, when something happens, please do something else.
For us, it turned out that was actually quite nice and it aligned with this event-driven approach. We managed to decouple our systems, so if a microservice now fails, only that single feature or microservice failed and not our whole application. That’s a huge benefit, of course. The whole Step Function to implement the orchestration, also a huge benefit to do so.
What have we learned? We need to plan for observability, because now, suddenly, we have dozens of microservices, dozens of Lambda functions listening, and Step Functions, and other pieces and components asynchronously somehow interacting with each other, and we want to keep track what’s going on. That’s a big lesson, because maybe you’re submitting some event to some microservice over here, but some subsequent microservice horribly fails, and you want to know that, actually, that event caused the downstream failure. That’s a big thing, big lesson learned. Because in the end, it blew up, and then we were like, we have no idea why. You want to actually trace what’s going on in your system.
It’s, again, one big Step Function we introduced, we have this orchestration service. It’s not a good thing, similar to how the initial Lambda monolith was like an ok-ish idea, we want to break things up, we want to put them in different scopes. Also, it was hard for us from a mental model, because, in this application, we’re orchestrating not only bookings, we’re orchestrating factory onboarding users, and many things. Then we were like, ok, where should we put that piece? We’re doing orchestration, so it should go in the orchestration service but, actually, it’s heavily related to a user so maybe it should go in the scope of a user service. I would not do that again, like creating its own orchestration service, because that, at some point, did not align well with our mental model.
The last one is actually a tricky one. Within AWS, especially with these serverless event processing solutions, namely EventBridge, like the event bus, but also SQS, where we can queue different events and process them, they all have a so-called delivery policy. When you look it up, the delivery policy for most services is at least once, so that doesn’t mean zero times, mostly one time, but it can also be two times.
If you remember what I showed when I walked you through the Step Function part, that was triggered by an event emitted by the event bus. Now what happens if you have two events, two Step Functions start, trying to do the same thing, that caused side effects you cannot imagine. It’s hard to describe. It’s hard to grasp, but it’s a mess. You don’t want to debug those kind of side effects. We maybe see with the final and last architecture how you can actually deal with duplicated events.
Project 3 – Use Case (Siemens MDLA – My Digital Lab Assistant)
That leads us to the last architecture I brought with me. It’s called MDLA, an abbreviation for My Digital Lab Assistant. We built that for our Siemens Healthineers colleagues. What problem are we trying to solve? In the end, it’s a customer portal. If you’re a customer of Siemens Healthineers, you buy your favorite medical device, but then, of course, different things can fail. It does not get shipped. It’s damaged. It has the wrong color. It’s broken, whatsoever. In that case, you usually contact your Healthineers contact. To somehow streamline this in a more structured manner, we built MDLA. When I said those duplicated events can become a mess, there’s actually a concept that allows us to take care of that, and that’s namely idempotency.
The definition I’ve looked up here is that idempotency is like the capacity of an application to identify repeated events, prevent duplicated, inconsistent, or lost data. We learned those duplicate events, they can happen. That’s not good. We want to know when they arise, and then we want to treat them properly and handle them. That’s what idempotency is about. I’ve brought an example. Let’s say we have a Lambda function, and we’re trying to insert a new record, a new row within our database. Now, when we have duplicate events, we would insert two records, so that’s obviously not a good thing. What you now do, instead of directly inserting that event, you want to somehow identify that event.
Either you have your own ID that uniquely identifies an event, or you could think of hashing the whole event to get a unique identifier. Then, before you put it in the database, you actually want to look up that event ID in a so-called control table. In this case, could be some cheap DynamoDB table. Then, you have two cases. Either the event does not exist in a control database, so we know, ok, we have not processed that event already, so we are safe to put it in the original database. Then, finally, also have to update the control table. The other case, we already have processed that event. We get a hit in our control database, and then we know, we already processed that event. What should we do? Most of the time, you don’t want to do anything, you just drop that event, because you know you already did that.
With that in mind, we can actually take a look at how we built MDLA. Lessons learned: we don’t want to do its own orchestration service, therefore we put Step Functions in each microservice where needed. We would again have multiple microservices, because microservices are allowed to communicate with each other. We usually would have client credentials for each microservice, so we have an authorization in place. Again, have an EventBridge, but for asynchronous communication. What I did not mention, actually, is it’s fairly tedious for clients to connect to all of our microservices. We have dozens, or even hundreds of microservices, and we don’t want the clients to connect to each microservice, aggregate data across microservices, that’s very tedious.
Instead, what we introduced in this project is so-called gateway service. It’s again, an API gateway and a Lambda function. We would this time deploy it in a private subnet, because then we can create so-called interface endpoints and assign those downstream microservices, so-called resource policies. With that, we switched these client credentials overflow, or exchanged it with AWS IAM, so the AWS authorization service. That’s why we choose the cloud provider. We want to pass on the responsibility, because we don’t want to keep track of our secrets and so on. That’s why we chose to implement it in this way.
With that, we actually unveiled the last architecture, but again, why did we choose it? All the previous benefits. We entered this rabbit hole of event-driven architecture, so of course all those advantages apply. In this case, we really benefited from those private API gateways because we switched out our custom authorization mechanism and leveraged what AWS already provides for us. Of course, the whole client connections with the gateway service. Lessons learned: I cannot stress it out enough, it’s again, observability. You want to start tracing. X-Ray is the AWS service to do so, but you of course can also use OpenTelemetry. Again, those private APIs, this was both like a benefit where you wanted to use it, but we discovered it on the way. It was also a lesson learned for us.
With the Step Functions, that’s actually an interesting one because you can actually version your Step Functions and assign different aliases for them. It’s actually a very neat feature because that allows you to publish new versions, new adapted workflows without breaking anything old. That ensures you some backward compatibility. Because imagine you have your approval workflow, it’s running, you’re waiting for approval, but now you’re changing the overall process, you’re deploying a new Step Function. How would you do that? You have to still wait until the old workflows succeeded. With that, you can actually run multiple versions of your Step Functions simultaneously.
Summary
We started with a fairly simple serverless architecture when it comes to AWS, with first, an API gateway, a similar Lambda function where we have all our code inside, and a database. We switched it up. We changed that compute bottleneck in favor of a database bottleneck. Then we discovered Step Functions to actually implement approval workflows, which then allowed us to implement a more complex system with a lot of orchestration going on, such as the Digital Visit Management. Then the final architecture with MDLA. We implemented our handlers in an idempotent way. We were checking, have we processed an event already or not? With that, we eventually built more resilient architectures. That was more like the journey we’ve taken.
Questions and Answers
Participant 1: I had a question about the best of the best architecture that you built. You introduced the Lambda function that is going to be taking in all the requests coming from the users. Did you make your architecture synchronous at the end of the day? Is your Lambda function waiting on the process to complete? Is this your new bottleneck?
Hanisch: Yes, it is. First of all, why did we do that? We learned that already, a single Lambda bottleneck, it’s a bad thing. We want to avoid that. In this case, we are not doing the heavy business logic in there. We are aggregating across multiple services. Of course, there’s the risk to have that single bottleneck, but in our case, we accepted that risk because back in time, we’ve been good at building small Lambda-Liths, so we built a small Lambda-Lith, obviously. How would it do things today? There are different solutions you could tackle that.
What also we are investigating is, for instance, using AppSync. That’s GraphQL. The managed allows you to build GraphQL APIs, and it’s also a fully serverless service managed by AWS. Then you would get rid of your bottleneck. Obviously, yes, very good catch. Again, a bottleneck. In this case, we accepted that risk. It’s an intended bottleneck.
Participant 2: One question more towards the application side. One issue that I always found with the serverless architectures is the unit testing, because the whole application is interdependent on calling both the tree, and then to properly test it, you need to have it in cloud. How did you solve the unit testing issues?
Hanisch: All the application layers, like all our controllers, all our Lambda function code, we would unit test.
Participant 2: For example, Lambda is going to call EventBridge, do you have some mocks for that?
Hanisch: Yes. I would love to say that we are doing it all in a fully automated way, and we build up whole test environments within our CI/CD environment, do our end-to-end test, and then tear it all down, maybe at some point, but we don’t do that right now. What we do instead is, at least locally, we use LocalStack.
That’s a framework or a tooling that allows you to emulate some AWS environment, and then you can locally start your AWS services. Within our CI/CD integration, we run unit tests, we do linting, and we do some integration tests where we mock databases. Because for DynamoDB, there exists a Docker image, of course, also for PostgreSQL and those relational databases, so we spin those up. Yes, like EventBridge, or those other managed or proprietary AWS services, we just mock them away in our tests. There’s not full end-to-end tests, but at least from the application layer to the database layer.
Participant 3: We don’t use AWS in our case, but generally we have huge volume APIs that are going to the database for data, and normally this data is very large. We queue the request to the database. What if the containers receive the request and they are willing to accept the request, but then when the data is processed in the database, these results are coming back to the container, and then the container basically is having a problem. There is a queuing mechanism when the requests go to the database, but what of the reverse queuing? Is there a way to queue also the response that comes from the database?
Hanisch: We are sending a request to the database, and you’re asking whether we can queue or receive the response? Definitely there is a way to do that. You have to build it. When you use from your Lambda function, or from your compute environment, you send the request to the database, you get a result, and then you just can save that result somewhere. I don’t see anything that would restrict you from doing so.
Participant 4: Let’s say you have a big monolith and you want to split it up. How small should a small piece be? Should it be just one REST endpoint or one entire domain.
Hanisch: We ask that question ourselves a lot because we have a lot of monoliths we’re trying to refactor. Of course, it depends. There is no hard rule that you say, if your Lambda function code bundle size exceeds a certain threshold, you’re doing it wrong. However, there are methods like domain-driven design and stuff like that that allow you to define the scope of what are suitable scopes of your microservices, and then you will naturally see where to put which part. There are methods to build or to get started with those event-driven architectures.
Participant 4: If you have a Lambda which only contains one REST endpoint, do you really then still use an entire REST framework like Flask?
Hanisch: Obviously, you don’t need that. Also, the question we had in the beginning, we have, again, a Lambda-Lith and having that single point of failure, what you also can do is to have a single Lambda function for all your REST endpoints. Your get orders endpoint invokes a dedicated Lambda function, but also your post orders endpoint invokes another Lambda function. With that, you would also eliminate that bottleneck.
Participant 5: Across your three different versions of application, you’re always stuck with Lambda functions. Have you ever evaluated going for something like Kubernetes? At least at the very beginning, you had this one fat Lambda, and then with Kubernetes, you could have run that apart with autoscaling, you get similar functionality. Why did you stick with Lambda?
Hanisch: Because we have absolutely no idea of Kubernetes. Lambda also has its limits, so certain payload limits you can pass through. When we would hit those limits, either we can refactor our application so that it fits, or we would then opt in for a provisioned solution such as AWS ECS, that’s the container service. Then we would go more into this image-based. Kubernetes was never an option for us.
Participant 6: My question is about your second project. You mentioned having multiple microservices, and those microservices having their databases. In this kind of solution, should we measure infrastructural costs with the benefits of this kind of architecture? Should the data be written synchronously to those databases?
Hanisch: I’d like to answer your first question, having a single database over multiple ones. Definitely that’s going to increase your bill. Again, it’s a tradeoff. You can also build such a microservice environment with having a single database.
Then, again, it’s up to you whether you are accepting that risk to have that single bottleneck, to have that single point of failure. As we did with this architecture, now we also have that single bottleneck, but we accept the risk. It’s more about awareness and making conscious decisions. When I talked in the beginning, like the optimal architecture, so with all your questions it’s not the optimal architecture. That’s what we consider good, like an optimal in a local optimum. There’s no global optimal solution for that. It’s always a decision you have to take.
Participant 6: The second question was about writing data to those databases. Should this be synchronous, or should you just know where you wrote the specific data, and while reading, extract data from that specific database?
Hanisch: There’s no global recommendation or general recommendation. If you, for instance, rely on transactions when you’re writing, then of course, you want to do it in a synchronous fashion. If some inserts are taking ages, then why not offload that to a downstream process? It depends.
Participant 7: You said that Lambda processes an event at least once. In case your Lambda has multiple transactions, then do you need to implement like a rollback mechanism outside of the Lambda, or how would you advise to manage?
Hanisch: You’re asking where to handle those duplicated events, whether we want to do this in the Lambda? Personally, I would recommend to build all your handlers in a resilient and idempotent way because you never know. Generally speaking, you want to protect your system. If you would consider this whole AWS cloud as your system, you only want correct data entering your system, but there’s serverless, so you cannot avoid it. There’s also tooling, how you can implement that in an idempotent way.
In case of AWS or with Lambda functions, you can use the Lambda Powertools. This is a very lightweight package, which does it all for you, you only have to pass in your control table, and then it’s like a small wrapper, and then you’re safe. There’s tooling out there. Generally speaking, I’d recommend you build it all in an idempotent way, because it’s just more resilient. It’s a very little overhead to implement it.
Participant 6: In terms of migration of existing systems to serverless architectures, is there any way to migrate just part of it? Let’s say you take your GraphQL, and then you put API gateway in front, or you take one domain and create a Lambda function for that domain, or how is it to migrate existing systems to a serverless architecture?
Hanisch: We also have a lot of cases where we have to do that. We have on-prem applications running somewhere deep down the Siemens intranet, and we want to elevate them to the cloud. Of course, if you have a complex system, you do it all at once. You should probably have some roadmap. Because it could be easier to migrate the whole monolith to the cloud, and then within the cloud, you refactor, try to identify domains, try to identify service boundaries, and then migrate.
Of course, also, this is like a step-by-step process. What we usually would do is we would have a connection from our on-prem data center to the cloud environment. There are different options how you can do that with probably any cloud provider. Then, you just allow them to communicate, and then you can move part by part to the cloud.
Participant 8: Modeling with state machines and events can be a bit tricky. How do you handle when the events come out of order? If the events arrive out of order, not in the intended order that you expect, suppose you model a physical system with a state machine.
Hanisch: In our use cases, we don’t rely on event ordering. We just don’t have that requirement that we process all events in the same order. That does not apply, at least for us. What you could do, again, AWS got your back on that. There are different tools on how to enforce or keep ordering of your events. For instance, using a certain type of queue, first-in, first-out queue, stuff like that. That way you could ensure processing all your events.
Participant 8: Is the EventBridge guaranteed to deliver the events in the order they were published?
Hanisch: I think they do best-effort ordering, again, with the same thing with the at least once. If you rely on strict ordering, this might not be sufficient for your use case.
See more presentations with transcripts