Building Distributed Event-Driven Architectures Across Multi-Cloud Boundaries

Transcript

Teena Idnani: How many of you are already working on multi-cloud? Seems like quite many of us are already working on multi-cloud. Let’s see what this Flexera 2024 State of the Cloud Report says. It says that 89% of the organizations today are working on multi-cloud. Only 10% of our organizations are working on a single cloud provider. A mere 1% of the organizations are on a single private cloud. What does this tell us? The message is loud and clear that multi-cloud is no longer a choice, it’s a reality.

If you look deeper into this, 89% of the organizations that have embraced multi-cloud, you will see that 73% of these organizations, they are utilizing hybrid cloud architectures. That is combining your on-premise systems with your different cloud providers. Why is that so? That’s because majority of these organizations are already on-premise, and now they are basically migrating and modernizing their workloads to different cloud providers.

To understand that further, what we are going to be doing is we are going to be taking an example of a completely fictional bank, let’s call it FinBank. It’s 100 years old, traditional bank, and it’s also on its way on a similar journey that is a journey to multi-cloud. This bank, it has built its platform on-premise over multiple decades, and this platform has evolved through multiple re-architectures across.

The bank has now decided to adopt multi-cloud. This decision to adopt multi-cloud, it didn’t happen overnight. It’s not like they got up one fine day and said, let’s do something different. Let’s go multi-cloud. No, the reason for the bank to adopt multi-cloud, it basically stems from the need to modernize their infrastructure, their architecture, in order to compete with all these FinTech startups that you have, along with maintaining the stringent compliance regulations that they have to. To understand what exactly the FinBank was dealing with, let’s take a quick look at what the high-level architecture of this bank looks like. It all starts from finalizing the core capabilities of the bank, like your core banking, credit decisioning, cards.

Then you wrap them up into microservices and then you aggregate the microservices and expose them to your different customer channels using an API gateway. In this particular example, the bank that we are talking about, it has built its on-premise infrastructure using event-driven architectures. A lot of these microservices, then they emit a lot of events and logs, which are then sent to your platform services from where the different operations team, they pick up these events and messages and they use them to basically do the health monitoring of your application.

Then you also have a lot of other interconnected components, which are talking to your different customer engagement channels. You have analytics and business intelligence. You have third parties, government bodies, regulators. I think what I’m trying to show here is that this isn’t just a high-level architecture diagram. This is a full living ecosystem with hundreds of interconnected components. For a second, just imagine moving these different interconnected components across multi-cloud. That is the complexity I’m talking about. In fact, let me ask another question. How many of you would like to take this architecture to multi-cloud? I do see like not many people are fond of taking this to multi-cloud, but I’m really sorry. You know why? Because you have to.

Context – Migrating to Multi-Cloud

As a part of this session, I’m going to consider all of you as the engineers of this completely fictional FinBank, and we will be taking part of this architecture to multi-cloud. I am Teena Idnani. I’m a Senior Solutions Architect at Microsoft. Today, you and I are going to take part of this architecture to multi-cloud.

You are the engineers of this FinBank, and you need to migrate this to multi-cloud. How do you go about it? A few decisions to start with. You’re going to start small, and you’re going to scale gradually, which means you’re not going to take everything to cloud at once. You’re going to identify a few independent workloads that you can take to different cloud providers, evaluate them.

First things first, the core banking stays on-premises because of its complexity of migration, and you really don’t want to touch with core banking to start with. You start looking at other components. You take a look at the risk management component. You take a look at your advanced analytics and business intelligence component. You look at, what are the different providers that you can take them to? Should you take it to Azure? Should you take it to AWS or any other cloud provider? The evaluation is not easy, but what you’ve decided is, for example, you’re going to take the risk management component to AWS, leveraging its security services, and because of the global certifications that they have.

Then you decide that the advanced analytics and the business intelligence component is something that you’re going to take to Azure, maybe because of their data and AI capabilities. Then you also decide that you would like to probably centralize your DevOps services in Azure as well to give a unified pipeline deployment experience across all your different components. You get started with this. You start the migration, and you start seeing challenges. You start seeing challenges like, how do you make these distributed systems talk to each other effectively? Then, you start seeing issues. Going further down to what exact issues you start seeing, you start seeing latency. When your events are traversing between the different cloud providers, you start seeing those latency spikes.

Then you feel the need to build resilience in your distributed application, because now you need to have your failover mechanisms run seamlessly across all the different cloud providers. Then it becomes more important for you to take care of the event ordering and consistency, because, again, your events are traversing across your cloud boundaries. Different cloud providers may have a different timeline of processing the events or receiving the events, so that can jeopardize the event ordering that you’re talking about.

Then you end up seeing duplicate events. Again, because your events are crossing network boundaries, sometimes retries happen, you might end up then processing the same event twice, so you can’t see duplicate events, so you need to take care of that. Then observability, absolutely. You need to have that common logging and distributed tracing available across your distributed application, so that if you take one transaction, you are actually able to see the full journey of that transaction across different components, across different clouds.

Challenges with Event-Driven Distributed Architectures

What I’ve done is I’ve categorized these challenges into certain categories, and we are going to be looking at each of these individual problems. I’m going to try and share at least one practical implementation or a solution that you can implement in your application. We are going to be looking at latency, building resilience, event ordering and consistency, and duplicate events. Then, towards the end, I’m going to touch upon some other additional considerations, which also become very important when you’re looking holistically in this multi-cloud distributed application, event-driven application context.

1. Latency

Let’s first start with latency. One thing that’s really important when we talk about multi-cloud latency, is connectivity. Whether you’re talking about connecting on-premise to Azure, or on-premise to AWS, or even AWS to Azure, and vice versa, one thing that you really need is reliable low latency links between these components. Then, there are some certain considerations that become important for you when you’re working on these multi-cloud, distributed applications. Things like, one, your mission-critical applications cannot be dependent on a public internet.

Second, if your application loses connectivity, you would still want to be able to access your data. Different cloud providers have different mechanisms to handle the connectivity issues between the different components. For example, Azure has ExpressRoute. Then AWS has Direct Connect, basically, they provide you with these reliable, dedicated links that help you bypass your public internet, and give you that high-performing, low-latency connectivity links between these different components. Again, it all depends on what exactly your application architecture is, as well as what your organizational requirements are. I think what I’m trying to show here is that networking is a key when you’re talking about multi-cloud environments. Even though networking is crucial, but just having networking is not enough. You also need to ensure that you are taking care of these latency considerations at your code level.

To show you that, I’m going to show you all a code snippet here, which talks about an implementation which does not take into consideration these latency considerations, and then you are going to fix how we can implement it in a better way. Here is the code snippet. It’s a very typical transaction service implementation. Notice how your operations are structured. We initialize the Kafka producer. After that, we create the transaction event that we want to publish, and we publish it.

Then, at your subscriber end, let’s say this is your risk management service, which is hosted in AWS. You receive the event, and you do your normal risk processing. Similarly, at the Azure end, you receive the event, and you do the normal processing. You can see here, this is a very simple implementation. It does not have any specific considerations that you need to do to cater for latency between multi-clouds. How can you fix it? There are a few optimizations that you can do. Again, showing up the code snippet here. You are in your producer configuration for publisher. What you can do is you can look at some compression options to reduce your bandwidth usage between clouds. When your events are crossing cloud boundaries, every byte counts. Looking at such options can definitely help you to save some latency.

In this particular example, I’ve used the compression option as Snappy, because I think it’s a good balance between your CPU utilization, your disk and network usage, as well as your compression ratio. Take a look at your specific application requirement. Check, is your application ok to have some extra CPU cycles in order to save the disk and network bandwidth? Make decisions accordingly. Another thing that we can do is we can use this batch optimization for cross-cloud transfers. What I mean by this is, using your large batch sizes with small lingering delay, it can really improve the efficiency. The approach becomes very valuable when you’re doing hundreds and thousands of transactions.

In fact, if I talk about a real-life scenario, when we were doing data synchronization across multi-cloud in one of my previous projects, by using this optimized batch size, and probably we used a lingering delay of around 50 milliseconds, we actually saw a reduction in end-to-end latency by about 40% to 60%. That is after each operation had to wait for that 50 seconds before being transmitted. You can see these when you’re doing things in batches.

Then, next, you can add extended timeout values, specifically calibrated for cross-cloud communications. I’m sure all of us have dealt with default timeout settings. My personal opinion, I think we should never go with default settings. It can be very dangerous to use default settings in your production environment. It’s not easy to choose the right timeout setting for your environment, even for probably your on-premise or a single cloud provider environment. Imagine doing it across a multi-cloud, that becomes much more important. It’s not an easy task. Just like in architecture whenever you have any decision, it comes with tradeoffs. Similarly, even picking up these timeout values come with their own tradeoffs. For example, if you pick too long a value, a timeout setting, then it’s not going to be very useful anyways, but you will also end up consuming more resources and increase latency.

On the other hand, if you pick too small a value, too short a value, then it might mean that you will end up retrying your operation a bit too early without giving your original request a fair chance to complete, which we know it means that you are inadvertently increasing the load of your underlying system, and we know how that goes. You end up seeing some cascading failures, which also have the potential to bring your entire application down. Long things short, choose your timeout values carefully, spend time in doing them, and ensure that you have control over these values.

Then, you can also make use of account-based partitioning for consistent routing. This helps in those scenarios where because you’re using this account-based partitioning, then all the transactions related to a particular account, they go via a particular defined route. That can help in ways like if you have caching implemented, then you can avoid some expensive database lookups and make use of the caching to get the data that you need. These are certain considerations that you should be looking at when you are thinking about latency in your multi-cloud distributed environments. All of these settings, they differ based on what your specific application requirements is, what your message size throughput requirements are. The key here is to recognize that you do need these deliberate optimizations to be done at your code level in order to ensure that you are taking care of the latency considerations when you come to multi-cloud scenarios.

2. Building Resilience

That is about latency. Let’s jump to building resilience. When you talk about resiliency, one thing that I do want to focus on is that resilience, it extends far beyond immediate availability. When I talk to people and we talk about resilience, what I usually see, as engineers, we all put our efforts on, how do we ensure that we are handling the failures that occur during an outage or when a service component is down. What is equally important is, how do you recover from those failures after the outage is over? I feel like this is sometimes an overlooked aspect when we talk about building resilience. Let’s examine again with an example what happens during and after an outage in the example that we’re talking about. Let’s see. We have our risk component running on AWS. What happens if your particular component or that AWS service is down, and if you do not have resilience built in in your application, your event is not going to be retried and you’re going to lose the event.

Similarly, if you talk about Azure, if it does not have resilience built in, and the component will not receive the transaction because there are no retries, no circuit breakers, no replay mechanisms, you’ve lost the event forever. The issues that you see here, they basically manifest in two stages. One, during an outage. During an outage, your services, they’ll keep on hammering these failed dependencies because there are no circuit breakers in place, and that can actually worsen your outage. If you do not have the right retry strategies, then even your transient failures will end up becoming permanent failures because you’ve not really retried them, which if you had retried, might have gone successful. That is during an outage.

After the outage is over, if you do not have this resilience baked in, then because you do not have any event replay, then all the events that you’ve lost, you cannot restore those events even after the outage is over. It’s like you’ve lost them forever. Also, it will lead to your reconciliation issues. You’re going to have difference in the events which are in different cloud providers, and you’ll not be able to reconcile them.

How do you think you can go about fixing such issues? One way that I want to call out is adding resilience by design, which is by bringing an event store in between. When you bring an event store in between, and this event store basically can be anything. It can be your Outbox Pattern, or it can actually also be your persistent event store whose purpose is only to collect these events and replay them when required. In fact, you have your message brokers, for example, Kafka, they have these inherent configuration settings that allow you to retain all the events which are published through Kafka, no matter whether they are successful or not. It’s all about, you need a store where you are able to process these events so that later when you want to replay them, you have them available. That’s one way of doing it.

Then you have your other resilience mechanisms that you can add, which I’m going to show. Let’s again look at through a code snippet example, how we can fix it. Let me walk you through it. The first thing that we do here is, like I said, we add an event store. Before you publish your event, you make sure you persist it somewhere, go put it in your event store. Then you can add a comprehensive resilience policy when you are publishing these events. While you can invest considerable time in building your own resilient frameworks, however, such products already exist. Polly is a great resilient library that gives you these functionalities out of box, and just make use of them.

In fact, a lot of our cloud services today, they and their client SDKs also have these built-in resilient implementations that you can directly use. See what is already available before you go and implement your own.

Finally, you can also add an extra step to basically verify your delivery confirmation. This is basically to check if your messages are indeed persisted to Kafka before you consider your transaction as complete. This is an additional check that you can do, but the good thing is that these patterns, they combine to create a system that allows you to handle both transient and extended failures across cloud boundaries. During an outage, this is how you can ensure that your system is resilient. Let’s also talk about after an outage is over, how do you ensure that you’re able to reconcile your events or reprocess the failed events? For that, it’s good to build an event replay implementation. Again, it’s a pretty straightforward implementation. The event store that you had from the start, you query it to pull all your unprocessed transactions. After that, you try and publish them again.

If they’re successful, great, you mark them as published. If not, you update the retry so that they can be retried next time for a particular time that you want to. Overall, the combination of event stores, resilient policies, and these systematic event replay capabilities, it creates a distributed system that not only just survives failures, but also helps you to recover automatically, which is a very critical requirement for your multi-cloud architectures.

3. Event Ordering and Consistency

That was about resiliency. Let’s jump on to our next challenge that we’re going to look at, which is event ordering and consistency. For that, I’m going to show you all an example. We’ve been talking about on-premise publishing some transactions. We have your applications in AWS, Azure, who are receiving it and processing further on it. What can go wrong here? Let’s imagine your on-premise system. It sends a create transaction event. It sends it to both. It sends it to Azure as well AWS. Because of the network latency, Azure receives it first, processes it first, all good. AWS, it’s still waiting for the event to be received so that it can process on it.

While your AWS is waiting for the event to be processed on, Azure does some processing on the event that it receives, and then it sends that to AWS. Chances are that AWS receives message 2 earlier than your message 1, the message that it received from Azure earlier than it receives from on-prem, and it goes ahead and processes it. What could go wrong with this kind of example? The things that could go wrong is, your transactions can arrive out of order in your risk management as well as in your Azure analytics. The example was just showing one.

Then, what that could mean is that your fraud checks may complete after a transaction approval. Maybe the fraud checks were supposed to flag it, but you ended up approving the transaction because you performed your fraud checks later. Also, there’s no sequence enforcement. We are processing as we are receiving the events, and that can have its own issues. Ultimately, you can see inconsistent data handling, which specifically in your financial scenarios can lead to your incorrect regulatory reporting, which we know has its own challenges. It becomes really important then specifically in your distributed multi-cloud event-driven architectures that you need to consider how you handle your event ordering and consistency scenarios.

For that, what are you going to do? A couple of things that you can do. At a publisher level, whichever publisher of yours is creating your transaction, make sure that each event gets a strictly increasing sequence number. Your publisher has the responsibility of ensuring that any event that goes out of the publisher should have a strictly increasing sequence number. Then, if you are already using account-based partitioning, that also helps, again, for the reason that I told earlier, you have similar transactions going through similar services, so that can also give you inherent message ordering or event ordering internally.

Then, at subscriber level, which is very important, now you need to do the verification. Each subscriber, it needs to do the sequence verification. It needs to ensure that events are processed in the right sequence that it expects them to. It also needs to have your deferred event processing, you know the example that we saw, in case AWS receives the message from Azure earlier than it receives from on-premises, then it needs to defer the event processing of your message too, so that it waits to receive message 1 and completes that, and after that, it processes your message 2. That’s what your subscriber can do.

Then, touching upon consistency as well. Consistency in distributed systems becomes really important. Again, I think it depends from your specific application requirements, what are your consistency expectations? Do you want it to be strongly consistent, or are you ok with an eventual consistency? Different components may have different consistency requirements. For example, I’m just going to show one consistency mechanism, just to show how that helps. This is your standard consistency pattern. You have your user, which is writing to its primary datastore. Let’s call it your event store. All the writes go to your event store, but then the event store, it very quickly, asynchronously updates a read store from which then the user can do your other reads.

Then, to give you that scalability, to give you the high availability, your read store basically can also be writing to several other replicas. Then, the components which do not require very strong consistency, they can read from it and move forward. This is a very simple consistency example that I want to show here. I think the key things that I want to highlight over here are two. One, consistency isn’t binary. It’s not like you have consistency or you don’t. Consistency is basically a spectrum. For example, Azure Cosmos DB, it has five consistency levels, strong consistency, bounded staleness, session consistency, and so on. It’s a wide spectrum. It depends on your application requirements. How do you want to handle them in a distributed setup?

4. Duplicate Events

That was about event ordering and consistency. Let’s talk about our next issue, which is handling duplicate events. What can go wrong? What can happen if your applications are maybe publishing the same event twice or processing the same event twice? Let’s see it again with an example. This is a code snippet, which does not take care of handling your duplicate events. Simple, yes, if there is a network failure or maybe a retry happening behind the scene, your on-prem may end up resending the same transaction twice.

Then, on your service, which is hosted on AWS, the risk processing service, it’s ok. The only thing that can happen is it may end up processing the same transaction twice with respect to risk, which means more resources, but that’s still ok. When you come to Azure, the one which is logging your financial transactions, now if the Azure component ends up processing it twice, it basically means that you might have duplicate transactions logged, and that can cause inaccurate analytics reports. It can also lead to maybe inaccurate regulatory reporting. You see what challenges can happen if you’ve not taken care of handling duplicate events in your implementation. How I like to share a solution for handling duplicate problems is basically a four-level defense mechanism, like at every stage, how do you ensure that you’re not taking duplicate events into it? You start with your publisher.

At your publisher, the piece of code which is basically creating or generating the event, now that part needs to ensure that it is generating a unique event. One way to do it is, if you’re going to use a cloud event schema, for example. Let me take a minute here to talk about cloud event schema. It’s an open specification for describing your cloud events, and it can be basically used across your different cloud providers. It provides a common schema. It has fields like ID and source, which you can then use whenever you’re creating your event and you’re wrapping it up as a cloud event schema. Then you can use these fields like ID and source to ensure that whichever event goes out from your publisher is always a unique event, it’s not a duplicate event.

Then the second level of place where you can add your duplication check is your producer configuration. Again, if you’re using a message broker, which is doing your publishing part. For example, Kafka, it has this producer configuration setting called idempotent, which if you set, then it ensures that across the network, you’re not going to have duplicate events. This is basically your first line of defense for duplicate events across the network.

Then, at your subscriber level, you can then handle the duplicate events again in your implementation or your handler implementation. That whenever you receive the event, you first do a check with your process table. Maybe a table where you’re already storing all your transactions. You do a quick check in that table. You check if your transaction exists there, if your event exists there, which means it’s a duplicate, so you ignore. If it doesn’t exist, then you move forward with your processing and you also add a row in that transaction log table for your future duplicate events. That’s the third place where you can put it. The last place is, make sure that your event handler implementation itself is idempotent, which means that if you rerun the same implementation for the same event, it should not impact the event in a negative way. I call this like the four layers of defense that you can add in your implementations when you’re building, and they become really important again when you’re talking about distributed architectures across multi-cloud.

5. Other Considerations

With that, we have covered our four major challenges. Now let’s look into some additional considerations that become very important when you are talking about multi-cloud. In fact, the reason I did not do it here is because if I were to do them here in detail, it’ll probably take half a day. Let’s touch upon them. Let’s talk about security and compliance across multi-cloud environments. I think all of us are already aware how important these security and compliance requirements are, irrespective, whether you’re working multi-cloud or not. What makes it more important for multi-cloud is the fact that your attack surface increases when you go multi-cloud, which means there are more areas for you to be cautioned about to ensure that you’re not having those security loopholes and to ensure that you’re compliant. That’s why I think security and compliance for me becomes a very important consideration when we talk about multi-cloud.

Then, schema evolution and compatibility. I think all of you who have already worked with event-driven architectures or event-driven implementations would be aware that it does happen that you end up evolving your schema. Some change happens in your schema. It could be a breaking change, which then means you need to take care of. Again, bringing back the cloud event schema that I talked about, I think, again, that cloud event schema has some good versioning fields that you can use to ensure that you are backward and forward compatible when you’re releasing new changes in your schema. I think it becomes a very important consideration when you’re working on such implementations.

Then, observability, logging, and distributed tracing. Absolutely. All of us, I think, understand that we need to log the full event journey across all clouds. That means that you need to have good observability, distributed tracing across these. There are cloud-native platforms available that are designed to handle the complexities of these multi-cloud environments. Make sure you dive into them, you identify the right platform that would be good for you to do observability across your different distributed components.

Then, cloud-native versus cloud-agnostic design considerations. This is one of the debates that a lot of teams end up in, when they’re talking about multi-cloud. Like the reason you’re using multi-cloud is because you want to reap the benefits of multi-cloud. You want to use services which are highly performant, but then you also need to be cautious about the portability. I think it needs a good balance between what parts should you be focusing on, using the cloud-native way, or for which parts you need to have, basically, a cloud-agnostic consideration. This varies from project to project, application to application, but these are some good considerations that you should have. These are some good conversations, in fact, that you should have within your project teams, within your architect teams, within your engineering teams to discuss how you would like to implement it.

If you take care of these considerations in your implementations, in your designs, I think what you end up getting is a happy bank. Not just a happy bank, it gives a satisfaction to us also when we are writing code, which can withstand these failures. None of us want that 3 a.m. call, like waking up because the system is down. That becomes really more complicated and complex to manage when you’re talking about multi-cloud. Give yourself a peace of mind. Make sure you take care of these considerations well in advance for a good night’s sleep.

Actionable Insights

I am going to leave you with some actionable insights that have guided me when I do my multi-cloud implementations, and I hope they guide you as well in yours. Few things. Design for failure. Assume that components are going to fail, and design accordingly. We say that anything that can fail is going to fail, and potentially at the worst possible time. Make sure you cater for it. You design for failure. Embrace event stores. I think these patterns, they naturally address many distributed system challenges.

If you’re doing your event-driven integrations, then make sure you embrace them. Prioritize regular reviews and optimizations. Absolutely. What works today may not work tomorrow. It’s very important if you continuously review your architecture that you have, review your implementation, look for optimizations, look for refactorizations. Observability, absolutely. You cannot fix what you can’t see. It’s very important to have that in place. Start small, scale gradually. I talked about at the start, imagine if you were to take those entire interconnected components, the high-level architecture that we saw in the beginning, if you were to take in one go to multi-cloud? That would be really daunting.

The best thing is, start small, scale gradually. Move your independent workloads first, gain confidence, and then expand to other workloads. Invest in a robust event backbone. Absolutely. We’re talking about event-driven implementation. It is good to have a reliable messaging backbone for you. Then, team education. This is very important. Distributed systems, they require specialized knowledge. Remember, with great power comes great responsibility. You need to have your teams investing in continuous upskilling. Our cloud providers, they are expanding at such high scale. They are innovating at such high space. In order for you to stay up to date with them, and not just one, we’re talking about multi-cloud here, invest in your team education. I like to remember them with acronyms. I came up with this one because we’re talking about a financial bank here, so DEPOSIT(S). If you basically consider these considerations, what you get is an S for success.

Questions and Answers

Participant 1: I’m wondering about the difference between idempotent consumers and detecting duplicate events on the consumer side, are they the same thing, or is there more to the idempotent consumer?

Teena Idnani: What you’re trying to ask is if the idempotency is the same as in your subscriber’s end? I think idempotency, in its own, it means that if you are repeating anything, again, it should not have a negative impact on what the first transaction did. It’s basically the ability to be able to repeat things without impacting anything. It can mean different in different scenarios. For publisher, if you have an idempotent publisher, it means that the same events will not be published twice. When you talk about consumer, an idempotent consumer means even if it receives the same transaction twice, it’s basically not going to process it twice. Or even if it processes it twice, it’s not going to leave the system in a different state than what the earlier transaction would have left it.

Participant 2: I see a lot of Kafka in your examples. Did you consider other technologies than Kafka or other tools you can use besides Kafka? Everybody wants to use Kafka, but nobody wants to maintain it.

Teena Idnani: What you’re saying is, the example that I took here, it was using Kafka as the Pub/Sub mechanism, do you have other technologies that you can do?

It depends what kind of distributed applications we’re talking about. It depends where majority of your workloads are. In this example, majority of the workloads were on-premise. Basically, you’re looking at some event streaming technology that can help you in a cloud agnostic way to push messages to your different environments. I know there are different cloud services that you can use, depending on. For example, in Azure, you have Event Hubs that you can use if you want to do these. It’s again a tradeoff. It depends where majority of your workloads are. It depends what architecture your organization is already doing. We do have different event streaming technologies that you can use to publish your messages.

Participant 3: My question is mainly around assigning unique IDs to events and also ensuring the order on the consumer side. When you have high traffic systems, I believe that might have an impact on performance. What strategies can be applied in order to avoid that kind of situation?

Teena Idnani: What you’re saying is when we are doing these versioning checks in our application, in case of a high traffic scenario, it might impact the performance.

Participant 3: Also, when you check the order on the consumer side, obviously, you might need to check whether you’ve already processed an event with a higher order or whether you’ve missed an event with a lower ID, for instance.

Teena Idnani: I think one thing that you can do here is, again, make use of some caching capabilities if you can, or keep your database, which you’re checking, as close as possible where you’re querying from. Ultimately, this is a tradeoff that you need to do. If you want to do event ordering, it is going to impact your performance. It’s not going to be that you’ll be able to get both of them together. You can make use of possible things like caching to see if that can help you and provide you with some benefits, rather than doing your expensive database lookups.

Participant 4: In terms of proving, how do you prove consistency? What frameworks would you suggest to help prove it across multi-cloud as well?

Teena Idnani: Frameworks to prove consistency. Are you talking about, for example, if you’re talking about session consistency, then your user basically gets what they’ve written in their session. Are we saying, how do we prove such things?

Participant 4: Yes, rather than just trust that it would be consistent, how do you prove that the work you’ve done, the code you’ve written, to help prove the consistency throughout?

Teena Idnani: It goes to the database that you’re using, which is providing those consistency levels. If you’ve implemented that for this particular piece of implementation, I’m using session consistency, then you can do a quick test and you should be able to see the same results. That if you have made that write to that transaction and you yourself are doing it in the same session, you should be able to. If you’re not doing it from a different session, you shouldn’t be able to. I think all of these are ultimately like eventual consistency.

In a session consistency, you will absolutely get it if you’re doing it in the same session, but if you’re doing it in a separate session, then it might come after a few seconds. I am not aware of any frameworks that you can use to test it, except for the fact that the services which are giving you those consistency levels, then you assume that, yes, they are giving us those consistency levels and you validate them in your implementation.

Participant 5: Regarding when you talked about the latency, you mentioned the ExpressRouter service offered by Azure and also PrivateLink in AWS, that don’t need necessarily the message to go via public internet. Sometimes it is possible the events being sent from on-premise to one of the cloud services, and then from there, it has to go to another cloud service. Is there any service or anything in pipeline through which we can do, not necessarily via internet, but transmit the message from one cloud to another cloud in a more secure way?

Teena Idnani: How do we send messages from one cloud provider to another cloud provider or maybe a chain of cloud providers? Treat them as different components. You have on-premise, you have one cloud provider, you have another cloud provider, make sure that you are doing the connectivity between those individual components correctly. That’s how you can do it.

See more presentations with transcripts

Building Distributed Event-Driven Architectures Across Multi-Cloud Boundaries

Transcript

Context – Migrating to Multi-Cloud

Challenges with Event-Driven Distributed Architectures

1. Latency

2. Building Resilience

3. Event Ordering and Consistency

4. Duplicate Events

5. Other Considerations

Actionable Insights

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

GitHub Adds Post-Quantum Secure SSH Key Exchange to Protect Git Data in Transit

Your Cat Is Not Drinking Enough Water. This Discounted Fountain Can Help

The Salesforce Agentic AI Solution Explained |

Is AI Putting Jobs at Risk? A Recent Survey Found an Important Distinction

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Context – Migrating to Multi-Cloud

Challenges with Event-Driven Distributed Architectures

1. Latency

2. Building Resilience

3. Event Ordering and Consistency

4. Duplicate Events

5. Other Considerations

Actionable Insights

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News