Scaling Cloud And Distributed Applications: Lessons And Strategies From Chase.com, #1 Banking Portal In The US

Transcript

Durai Arasan: I run the architecture team and management for the chase.com, U.S. Before that, I was a chief architect for E*TRADE, another financial institution focusing on trading.

How many of you have seen a situation like this? Typically, what happens is that we plan for maybe 2x, 3x load, but when you put things into the internet, you don’t have any control. Who is coming in, when they’re going to come, how is this going to be used, because that’s how the internet is. Any event can potentially trigger it. It could be good for your business. It could be bad actors coming and trying to steal stuff. Both are there. What if the bad actors coming in, you have controls in place, you can block them. What if it’s actually real customer demand, something that happened in the market conditions. We are all seeing a volatile market, things are going crazy. We are impacted, and we need to really tackle a situation like this because the customers want to use the financial transactions.

In that situation, what happens? How do you really tackle? You have not planned enough, then, how will you really meet this? What can really break when things go wrong? Many things can go wrong. Your network device can go bad. Your load balance circuit going bad. Your application, your database connections, many things can break, and all can break all at once as well. You got to really think about, when it comes to scaling, how do you really make sure that you tackle situations like these? That is what we are going to be sharing, some of the goals and what strategies we have. How do we really take care of that part? I’m going to focus on three goals and then talk about some strategies addressing those goals.

Then, at the end, I will wrap it up with how we achieved that part of the chase.com cloud migration, which is huge. If you are into managing large scale systems, these are going to be very valuable lessons. I’ve done this many years, done with Chase, and before that with other financial companies. That is the experience I’m planning to share here, so that you all can utilize in your own companies.

Chase.com, Overview

Chase.com, it’s part of the JPMorganChase company. It’s currently focused on the U.S., and we are here in the UK as well, a digital bank. I met some folks and they were all giving very positive views about the Chase UK bank. I’m going to be sharing more from the U.S. experience, where we have 84 million plus customers that’s growing regularly. They have 67 million active customers. That means that they’re digitally active. We do have quite a number of branches, people can walk into. They use our web, they use our mobile applications, and they use the payment systems, which you regularly do. In terms of the volume, you understand how this is being built and utilized on a day in, day out basis. Roughly around a million customers on a daily basis log in, whether it’s web or mobile applications. That means that we need to be making sure that this is always available. That is the promise to our customers. We want to be making sure that we provide services, always on. That’s how we kept it as the number one portal in the U.S.

What are Key Goals?

Let’s look at the goals. What goals we have, as we went to the cloud. There are many other things, but for this talk, I want to focus on these three key things. Focusing here on scaling, not just scaling, scaling efficiently. We’ll get into the details of what does that mean. Then being highly resilient. That is very important, especially when you’re a financial institution. Then the performance. Does anybody like a slow site? People don’t like that, people move away from other things. You want to be performant, and how do we really achieve that?

1. Scaling Efficiently

Let’s start with the first one, scaling efficiently. When it comes to scaling people talk about, I can do vertical scaling, put more power into the systems, or I can do horizontal scaling, elastic scaling, and all that, is there. When it comes into being efficient, you really need to look at your customer pattern, how are they using it? How can you be predictive about what they are doing? Then being adaptive, leveraging the elastic scaling. Then, traffic shaping. Traffic shaping is one nice way of understanding where the functionality is more useful for customers, day in, day out. What are they using on a daily basis? You divide this up into different categories, and then focus on that, so that you can scale your critical applications. We’ll get into more details later.

Then the capacity management overall. You can’t just put too many servers, and then expect everything to work, and all that. There are tradeoffs with cost, and we will talk about the cost management as well. When it comes to scaling efficiently, the traffic patterns are important. As you can see in this graph, you look at the average traffic. On a regular basis, you have certain traffic that comes into your site, and you’re probably managing it, and then typically you may have other predictable patterns. What does that mean? There are some events that may be happening on a regular basis. Whenever there is a paycheck coming in, customers want to log in and check whether the money is in the bank. That’s one of our patterns. In your business, you may have a slightly different pattern.

On a daily basis, maybe they’re coming and checking something. Then there is seasonal peak. There are different events happening throughout the year, and so you may plan for it. These are things which you know, which you can predict. Then, you have unexpected events that you have no control over. Like I said, this is the internet after all. You can have bad actors coming in, hitting. DDoS is one of the things people do all this time, because being a popular bank, that’s one other thing we get. We get attracted to these bad actors coming in, hitting things. They hit not 10x, like even higher. They are able to utilize the same cloud we are all using. They have the distributed compute available. Then your load can spike, and then it can attack your applications. You need to tackle them. You need to block them.

At the same time, can you provide the same SLA to your customers who are coming in as a genuine execution of your transaction? These are the things you have to tackle when you are looking at a big, large-scale system. How do you make sure that we address that?

To do that, you need to take care of sizing things. Does size matter? Of course, yes. You need to right-size it. You need to follow the patterns we spoke about, and be able to avoid. Because when you talk about elastic scaling, people do think about that. I have this scaling, I configure it, and it’s going to magically work. It can work to some level, but what happens when your scaling is occurring? Your application is starting up, it’s booting up. It needs to connect to some services. It needs to connect into your databases. Each one takes time. By the time your instances are up and running, ready state, you lost a few minutes, and that few minutes could be vital for your customers. If large volumes are there, and everything is booting up, then you introduce contention. Don’t simply rely on elastic scaling, but think about the overall picture, or the pattern, and other things. Then, reserve compute. Having the compute reserved is also important. One, it can be available for you whenever you need.

If there are other contentions happening across other companies, because you’re using the shared pool of services available, that could be a problem. Then, cost savings. When you have reserved compute, that could really benefit as well in terms of the cost. As we talk about cost, does the cost matter? Does everybody have infinite money? The answer is no. That means that we need to make sure that we balance the cost, and the scaling, and the customer usage, and everything. There is a process, FinOps, I think most people are probably familiar with it. We apply that, and sometimes people do it once in a while. No, you need to do that on a regular basis, on a monthly basis, weekly basis. You need to look at what it is, because it can potentially cause a lot of damage financially.

In some cases, if you have some business needs, you may have other ways of doing it. You may have extra capacity. Your performance is super important because you’re running critical trading applications, and that sits in something, and you want to take care of the volatility, you can do that. It depends on the use cases. You may have different logic applied, and the cost may play a different role on how it works with the scaling.

Again, the scaling isn’t about adding servers. People think, I can throw a server at it and solve the problem. No, the other way to look at it is that when you are scaling, does the application really work well? Is the application needing scaling because of genuine customer demand, or there are some services, let’s say upstream, you have certain services that are getting queued up, and it’s slowing down the system, so that means that your thread is waiting on it, and it’s not able to execute.

That’s going to introduce pressure onto the CPU. It’s going to introduce pressure on the memory. That can trigger your elastic scaling to say that, I need now more capacity, because it’s not the demand that’s growing here, you have the other services that are backing up. That can trigger that as well. This is one such use case you really think about, how can I design for the failure tying into the scaling? You can introduce a circuit breaker, and making sure that there is a limited time it waits so that the response is coming back successfully, or if it is failure. It’s important that you don’t rely on just adding more servers.

2. Highly Resilient

Let’s look at the resiliency. That’s another key point of our overall goal, is being prepared for a failing system, because anything will fail, and you need to really make sure that you prepare for it. Detecting early is very important. Then being ready to fail over. If things happen, do you want to do a failover? The answer may be mostly yes. In some cases, probably no, and we have some answers. We’ll talk about what are those scenarios. These are some famous quotes from people who are doing things, like, “Everything fails, all the time”, and, “What may fail, will fail always”. You need to think about that.

Any of those components which you’re using, you need to make sure that that works well when there are situations like these happening. How do we prepare for it? Not everything can be 100% available, like we said earlier. The way we look at it is in four categories. We divide our infrastructure components into these four categories, and then the ones that are identified as supercritical, which is in the top section, we want to make sure that that is 100% available. Think of like DNS. You may have a best architected site running really well, but if your DNS fails, nobody can get you. You need to think about, what are your critical components in your architecture? Making sure that that’s 100%.

Then you go down to the next layer. That’s, let’s say, in your application. We call it a manageable category, in the sense that if the application fails, you can have a failover, and that can continue to work fine. That means that you can give four nines of availability in that case because you can withstand some failures. Four nines means 52 minutes or so in a year, you can afford to fail. That could be a very good thing if you’re able to build it, because you can’t make everything 100% available. Then it comes down to tolerable category. You can put certain components into the tolerable category. What if you have some token services you’re using. The token services, you’re retrieving it and then being able to cache that data for an hour, for a day, whatever time suits your need.

Then, during that period, if the service goes down and is not available, you can continue to work with the data that’s available in cache. That means that it’s tolerable within a certain period of time, because if everything is out for days and all that, that’s not the situation we’re talking about. It’s an outage of a few hours that could be tolerable. Then the last one is around the acceptable. If there are certain times you can accept to lose some logs maybe or some other things you can lose. If you think about it in this way and look at your architecture so that you can focus on where is critical, what are the other targets you can do. The impact severity can define your resiliency target.

3. Performance

Now let’s look at performance. Like I said, not all applications are the same. You can use the point of presence, and we’ll talk more about that, and then provide the better experience to your customers, because nobody likes lag on websites, especially in mobile. Does the speed matter? I asked this question to ChatGPT, because without AI in any talk, probably not a good thing. I wanted to put something in. ChatGPT said, yes, speed builds trust. People really want the better experience, faster experience. Even Google, when they are doing ranking, they do use the speed matter. Especially in mobile, especially when you have actually a network, mobile performance is even more important.

The last thing is infrastructure. If your customer is spending less time on your infrastructure to achieve the same goal, it costs less for you. Speed is very important. With all those goals, what were we able to achieve? I wanted to give you some headline in the beginning. We were able to slash 71% latency with applying all these different strategies from when we started in the beginning, started doing this, and then applying all these things. That’s the architecture strategies I wanted to share with you so that you can achieve them in your own business.

The Power of 5: Key Strategies

There are five focus areas we’re going to talk about, multi-region. I think all day you’re hearing about multi-cloud and all that, so we’ll talk about multi-region as well. High performance, automation, and how the automation really helps. Observability with self-healing. Without self-healing, having observability is not ideal. It doesn’t give you the best result. The last one, we wrap it up with robust security. Without security, there is no bank. We need to really make sure that we address that part.

Multi-region

Why multi-region? Multi-region really helps you to create the isolation. You can create the segmentation part of your functional separations, and that allows you to do region failures, or any zone failures that happens, any network failures.

Then you can contain your blast radius. What does that mean? There is impact when you’re talking about 84 million customers, you can restrict saying that one zone is having failures. There is only a small percentage of customers that are impacted, not the entire 84 million population. That is what we want to really focus on when there is an issue, because failures are going to happen, and you need to contain the blast radius with that. When it comes to the multi-region, what are the focus areas? What complexities come into it?

You need to really look at the DNS management, because when you have different regions, there are going to be load balancers on each one of them. You need to manage how those load balancers work. How do you really do traffic management between those things? Within that region, when you have multiple zones, what can you do? Then, do you want to replicate the data, the cross-region replication of the data? Those are things you need to really think about and apply on what makes sense, and how do you handle this multi-region.

Let’s look at a scenario within zonal failures. We have the load balancer sending traffic to these two zones on a single region. Each application says, I’m good, healthy, and the zones are feeling healthy, so that means the traffic keeps coming into each of those zones. What if one of that application is connecting into the backend systems and egressing out, but that’s experiencing a problem in one of the zones, not the other zone.

The traffic is going to keep coming, but if you do not detect the problem, so the application, yes, it has its own readiness probe, liveness probe, but if you don’t include other dependent systems and feeding back into the health check, you’re going to have an issue. Because if you don’t send anything back, the load balancer is going to keep sending traffic, and then your application is going to fail. Either you can return back into the readiness probe and liveness probe and be able to help tell that message back to the load balancer, or you can have proxies reroute into the other zones so that you can tackle that. Both internal and external failures need to be managed pretty effectively, so that helps to address downtime of any applications.

Then, talking about regional failures. Now you’re putting all that together, and if you have multiple regions, you want to make sure that there are pulse checks happening at the regional level as well. In this case, you can really think about where you want these failures to completely do a failover to another region, or can you afford to run as degraded services? When I say degraded services, people may be raising eyebrows, like, really, do you want to do that? That depends on how you segment your application. Is it critical services failing? Then you may want to do a failover. If it’s not that critical, you can continue to keep that application running, because you don’t want to have any impact.

Any time you do a failover, that is going to create thundering herd and other effects, so you don’t want to have that traffic shifting that’s creating more problems. You need to really look into that aspect of it. The health check with the criteria really determines how this is going to work, including the failure and success threshold of that application health check.

What are the challenges you may see when it comes to multi-region? When you’re building this multi-region, replication comes at the top. How do I now replicate data across, get the data consistency and all that part? That is important to really think about. If you’re tying that back into the sharding, you have the customers. In our case, the customers are distributed all over the country, and the data center is not everywhere. It’s only in a few places. Having this customer be able to shard and be able to tackle them closer to where they are, that may be a good way to address that problem.

Then you can potentially avoid a replication. It simplifies the architecture. Managing state, do you want the state to be distributed, or do you want to keep it in a region? That’s another optimization thing, because any time you’re doing replication and having it everywhere, and the customers are going from one call, and the next call going to a different region, that’s not ideal. You want to be able to manage that in a way that you can have a sticky session for that particular session, and then, if there is any failover, they can switch. Otherwise, they’re staying in a session. These are things you can apply so that the failovers and all that you can handle in a very effective way.

High Performance

Let’s look at the performance after. Obviously, high performance is important. The way I see is that good performance is like a strong dial tone. When you pick up a phone, if you don’t hear a dial tone, what do you think? People will try to bang on the phone and see that, ok, do I get the dial tone? That’s not the experience. You want to be free from lag. Same thing applies in the website of the mobile application as well. When you open the mobile, if the things are still spinning, you don’t want that experience. People are going to kill that app and move on to something else. To achieve some of the things, the edge computing is one easy way to do that. You can really maximize the use of edge computing.

If you look at any website these days, the modern website, all fancy, nice UI and all that, it’s very heavy on content. The content can be offloaded. You can push that out into the PoP locations, very close to the customer, only handle the data part, and focus on the critical services, your login, your accounts, payments, and things like that. Focusing on that can really achieve higher performance. When you come into that thing, I call it traffic shaping.

If you really look at it in this tube, you make your applications focus on, what is critical traffic? What is the high value traffic? What are the rest of the traffic? The critical traffic is that you can’t have a business without that. Like in our case, people day in, day out come log in. They check the balance. They will do the payments. These are super critical. It’s life changing things. People need these things on a daily basis. You cannot make these services not available. High value traffic, there are other set of services. You can bundle them. It depends on your business. You can actually categorize them. Making sure that the resources allocated to those critical services are always on. Even if the other traffic is not sometimes fully operational or degraded, which may be unacceptable when there is stress. In a normal situation, yes, everything works really great. Under stress, you want to be able to define the priority, and make the routing happen.

When it comes to content delivery, like I said, we are here in London. Our servers are, let’s say, in New York. If I had to go from here for every one of those assets I’m downloading, there is a physical barrier to the network. There is a latency from here to New York. What if that same content is available closer to London? There is a PoP location. Now that is cached, and I get it in a few milliseconds compared to maybe 100 milliseconds going all the way to the origin server. You can really achieve that. Same thing with the security. If there are bad actors coming in, you don’t want a hit with that origin server. Instead, they are all stopped right here at the edge, and all the malicious traffic gets blocked. It also saves cost. All the bandwidth, the network you are hitting, that doesn’t have to be incurred.

Then the other thing, when it comes to high performance, many folks really don’t pay attention to this last-mile connectivity. It ties back into this content delivery as well. I wonder this picture whenever I see something like this happening, like, how is this happening? People may be thinking that this is very cost effective. Probably, yes, if it goes into one location. What if this person is delivering to 12 different places if there are multiple stops for that to happen? You make a call from one to the other services, internet operations are very much like multiple hubs. There are network hops happening. Instead, if you are using the edge computing, the content delivery systems, what happens is you, from your home, into one of the edge locations, there’s probably one hub.

After that, its network is well-optimized, because they have an optimized network which operates much better than the typical internet from ISP to ISP. It’s important to pay attention to how the performance works in the last-mile connectivity. The other ways to optimize is the mobile. You have a mobile application. In our case, the majority of our customers use mobile. You have a free space in part of your application, so you can cache many things into that, including network resolutions, configuration settings, and prefetch some of the content into that app, so that boosts your performance.

Automation

Then the next strategy I want to focus on is around the automation. I think everybody is probably doing some part or the other. In our case, the entire pipeline, every step of the way, we have automated it. I think that gives a lot of benefit for part of your deployment, provisioning your infrastructure. Then we have the concept called repave, which cleans the environment, removes all the stale resources, makes the environment pretty safe. We’ll talk a little more about it. Then the health check, like health check tying back into the actions. Then the overall traffic management. Every one of those things, the major areas, we automate. That really helps.

If you’re putting a person in charge of managing a cloud environment, then that is going to lead into a lot of outages. Because we need to get the person into the call, and they need to be ready to really take actions. Or maybe they need to have a meeting, that’s going to waste time by the time you’re out. Your customers are not interested in continuing with your site. Automation helps. We’ll talk a little more about the health check and how we have integrated into the systems. Then about architecture. It’s all about architecture. I’m an architect. When we talk to other people about architecture, people think that architecture is only on PowerPoint. How do you make sure that you take the architecture from the PowerPoint into an automation? In the sense that you create an opinionated architecture. People need to build your application using that. They can be deployed automatically using the manifest you built. You give a template. People can enter into that manifest.

Then you feed that into an automated script with the Terraform-like infrastructure building. That can be delivered. What’s the advantage here? You hire your talented application team, and they’re focusing on the business functions. They’re building the business function, and they’re using these automated tools as a way to deploy their applications. That way, they’re not spending time to figure out, how is this Terraform going to work? Kubernetes, and why this is not working, and all that? They are focusing on the business function, which matters to your business.

Repave, I think this is one other thing we do very effectively, is repaving the infrastructure. It’s like reborn on every sprint. What happens is that, every sprint, you have automated ways of cleaning up the instances that are running it. How does it work? It really helps you with the security aspect of it. It eliminates the drift. If you have any drift, if you have any latest patch you wanted to be able to push, including zero-day vulnerabilities you want to take care of, it all can come. We used to run this in the old ways, that system runs or the infrastructure runs applications for much longer, and that creates stale resources, it doesn’t perform sometimes much, and it introduces all the security problems and all that.

By doing this, you recreate this environment every week or two weeks, whatever interval you define, that happens automatically. You’re making sure that you take the traffic out from that running system in a very easy way, and then create this environment by rebuilding that and then relaunching it. That gives stability. No more manual changes or anything, that gets cleared. It makes this whole system more efficient. We think that it’s a JPMC model, like within Chase internally, many people, when we talk to, they think that they could really achieve. It’s really hard to do. Once you put this into practice, it’s really going to help a lot for your applications.

Now let’s look at failover, when introducing the automation, that automated failover with graceful degradation. What does that mean? You do failover, like you look at the existing sessions, that means that the customers are already there, that the things are getting processed, what happens to them, versus you’re creating a new session that’s coming in, how do you reroute that? You also want to avoid any failover loops. If there are things that are failing in one, does it fail over to other places where they are also failing? You don’t want that loop coming in, so you want to be able to avoid that. Any latencies and other things could potentially be ok in some cases. If there are non-critical service failures, then you may continue to stay with that. If not, you can push that to the other zones or regions, whatever else, as a backup that’s available for you.

Observability and Self-Healing

Now let’s look at observability. When you observe any event, there are many events in the cloud, there are a lot of different components and everything emits events, system events, infrastructure events, and your own app could potentially do it. Everything you can observe, and you need to make sure that you act on it automatically. I think that is where that automation really comes in to help, making sure that this ties into your observability. You can write serverless functions, and that gets triggered automatically when an event comes in, and it can do regional switch. You can apply all these criteria, it can do that.

If there is a database problem, another function gets kicked in, and that does the database switch. Or if you’re doing a maintenance, and you want to block certain region or certain VPCs, you could have your serverless function do that as well. These are some examples of automated actions that you can build in, but making sure that it ties back into this observability. Having eyes on the glass is not the ideal way. You may have that, it’s an added advantage on top of these actions, but that is not the only option to really work with it. Self-healing with observability is the way to achieve this. Let’s look at, how do you really check the health? Where do you check? When it comes to health checking, you have to do it at different levels. Let’s start from the application.

The application probably will have a very complex way of determining, am I healthy or not? Because it may be doing many things. Whether the application itself is running fine, or the connectivity to the databases, cache, or other systems, they’re all healthy or not. You can really have a complex criteria management, all that, within the health checker, but when it returns, you cannot have complexity. You cannot have different variables saying that this is what is happening, or the other thing. It should be pretty simple as a Boolean, am I healthy or not healthy? That is the model of doing it. Within the application, you have a health check, and that propagates into the zonal level.

Then the zonal overall looking at across all the other instances. Then you push that into the VPC level, and the whole VPC, do you see whether the VPC is healthy or not? Then that goes into the overall global router. At every level, you can really see automatically whether this is healthy or not healthy, and be able to do easier decision making with simply a Boolean flag. That is how you achieve that whole self-healing by utilizing this health check.

Let’s look at an example. You get an alert that the nodes are not available. Your criteria is, capacity is compromised. Now, I may want to move away from the VPC, because there is maybe a provider issue, and the nodes are no longer available, now I have to move the traffic because I can’t stay in this thing. There are, let’s say, application alerts coming in that the latency is a problem. In this case, the performance is compromised. You can determine here that I can continue with the degraded services, or I can think that I have an SLA to meet. There is a business demand. There’s an SLA to meet. That means that I may have to do it. That’s based on your business criteria, you can determine. In this case, we decide that we continue with the degraded services. That means that it’s slower rather than going into another zone because there may be, potentially, the same issue there as well. I’m going to stay on this thing.

The other one is around gray failure. It’s no man’s land. It’s not deterministic exactly whether things are really failing, but it connects. There’s maybe a network-related failure, because sometimes the network is harder to really find out what exactly is going on there. There, the business function is compromised. Should we reroute to the healthy zone? Yes, that may be an option to do. You can apply these different ways of taking actions that ties back into the observability.

Robust Security

Now let’s look at the security aspect of it. The security thrives in layers. We moved away from the network zones and all that into a zero-trust model where every layer is important, starting all the way from the client. Your mobile device or web browser could be compromised with malware or other things. You really need to look at all the way from your client to perimeter and then internal network, your container, application, to making sure your authentication and authorization and all that are taken care of, and ultimately the data. The encrypted data, privacy controls, and all that is there. Each layer fortifies the next. We need to make sure of the holistic picture of security at the different layers.

How Did We Do It? (Migration of chase.com)

We looked at the goals, we looked at the strategies, how it ties back into it, and we can see how we applied that so that you can do very similar things in your own company. We’ll look at the migration of chase.com.

First thing is the cultural shift. It’s very important to have that mindset because the cloud operates in a different way than the old on-prem alert systems, because the cloud provider updates, network policies change, browsers, a lot of things change. You need to really look at the Well-Architected Framework and some other things we talked about, but there is a long list of things where you can really look into that aspect, apply those principles to your applications.

You own, you build, you deploy the model, that means that you give the responsibility to the application team. Humans forget. We all forget things, and sometimes we get lazy about certain steps, but machines do not. That’s where the automation really comes in handy. When it comes into looking at testing and verifying things, people use Chaos Monkey and all that, but that is a reactive way of doing it. You just go and do something. Failure mode effect analysis, that helps to really do predictive analysis. You’re trying to predict what will go wrong, and you do systematically analyze each of those components and do that. You can do both, but we prefer to do this more because we can really test in every layer of the application, making sure that we analyze and develop strategies to mitigate it.

One other thing we have developed within Chase is called TrueCD. This is Chase’s way of doing CI/CD, where it’s a 12-step automated process. There are a lot of details we have written up in the blog. There’s a link, and if you’re interested, check out our blog that walks through the detail of how we achieve that. Think of like, every time you take a flight, there is a safety check happening. They go through this thing before they take off. This should be like that. No compromises on going through this process.

The other important thing, the way we look at it, any time you go from on-prem to the cloud, or you may be already on a cloud and going to maybe multi-cloud, then, what is going to be impacted? Your application. The application has many different logic, and if you’re constantly going to really make changes, you’re introducing change, and that can create effects for your business. You don’t want to do that. Having an abstraction layer that really helps to minimize the changes. That allows you to still use the best-in-class components that’s available on a cloud, one cloud, multiple cloud, on-prem, combination of that, all that is fine, but you can do that. We did our own abstraction layer that really insulated us, the ability for us to run on-prem and cloud, and also allows us to go to additional clouds as well. If you are interested, Dapr is a good open-source framework. It’s a pretty good way of utilizing it, and supports your multi-cloud.

Then, moving customer traffic. When you are, again, talking about large applications, it’s not easy to build and deploy and move customers in one go. It takes time, and so you can do it in a multi-step way. You can proof the system, see things are working. Maybe potentially save this testing with your internal customers, and then let the application bake in. Sometimes people are in a hurry to really jump in, but some of these issues and the patterns, remember the patterns we talked about, those things you may not be able to really see for some period of time. You need to allow the application to run, allow for optimizations. Then, if you have a large portfolio of functionalities, completing all features may not be enough time. You can break it up into a different application set, allows this. Throughout this phase, you can do a small percentage of customers at a time, and eventually you complete your migration. It’s a good way of scaling things out in terms of even the migration part.

Result

What is the result? We did all these things. What did we end up having? We are able to save significant cost from what we are able to achieve. Performance-wise, we’re able to get to the top of the chart. This is the Dynatrace public report comparing with other U.S. banks. That is the thing, like if you are running a site under 1-second performance, that’s the best. At least 2 seconds, yes, still good. If it’s going to be longer, people are not going to be really giving a lot of value for your businesses.

Summary, and Resources

What are you going to take away here? There are tradeoffs. Consider the cost implications, and performance, without implicating other things. You need to really look into that. Like, for example, you’re running multi-region. Should I replicate the cache? Look into that aspect, keeping it in one region versus multiple regions. Operational complexity, because cloud has many components, a lot of things are used in your architecture. Reduce the complexity. Reduce the manual effort of looking into the application. Automation is the key. Then, contain blast radius. Any site is going to have an issue, and some components are going to fail. When it fails, does it impact all of your customers, or a small subset of customers? That is something that is important to focus on. Make sure it’s action-oriented. Observability, tying into the automated actions. It’s very important.

Finally, the customer focus. We all do business for our customers. Think of the dial tone experience. When you pick up the phone, you want to hear the dial tone. Same thing for your application. When they pick up your mobile application, you want to see the result. Finally, scale smart, stay reliable. If you want to learn more about what we do at Chase, check out this site, https://next-at-chase.medium.com/. There are a lot of good articles, blogs we have written. We are sharing what we are doing within the company, so that the broader community benefits with that.

Questions and Answers

Rettori: I felt from your presentation that achieving the success that you achieved, it was about realizing efficiency in every single layer. You strived for efficiency in every single possible place. It feels like it’s optimizing for everything. How do you behave within the architecture team to continuously build the knowledge, so that you can always optimize for everything? Because I heard optimization in many layers. I want to learn a little bit about the culture and building a team of architects that can think and do that.

Durai Arasan: I think the first thing as an architect is that, don’t stay with just being on the design, putting it in PowerPoint. Get to the real. That’s one other thing we do. Go to the engineers, not just engineers, go to the production incident. When an incident happens, when you go into that, you learn a lot. That is what we do. We don’t just sit as an architect and do drawings. We go and get into the actual situation, analyze what the incidents are, try to debug, and then learn. That opens up where your design is failing. You may design good systems and all that, and somebody is doing the actual engineering work, but then if they make mistakes, you’re not really seeing it, that’s not good. You need to complete the loop. We talk to product folks first. We convert that into the design. We work with engineers, and the engineers translating that to actual applications, then we connect with them, including the production. That’s the way we learn. Then apply that into how to make sure that we improve our architecture.

Rettori: What do you look for in an architect, when you’re hiring an architect to work with you?

Durai Arasan: One, having an engineering mindset. Are they able to really code? Not necessarily every architect, when you become a certain level, you may not be coding, but you need to really have hands-on, be able to do that. Are you updated, always updated in being able to do it? There are a number of other things we look at, but I think the resource needs to really have that engineering mindset, be able to come and work as an architect so that they can talk to it.

The other thing we look at, especially in a company like Chase, where there are different layers of discussion we need to have, where the architect is able to talk to the engineer, talk to the manager, maybe even go and talk to the executives level? It’s like the ladder of abstraction, that’s how I see it. Can you go climb up, down on the ladder, be able to communicate, interact with it, able to convince it? Because a solution like this, what we discussed, you need to be able to really make an impression and go get buy-in. Remember, we talked about the cultural shift. That means that if the organization is not ready for cloud, it may not be now, but it used to be the case earlier. That means that an architect needs to go build that case and be able to convince it. You need to have the talent and utilize that talent to achieve something like what we did.

Participant: I have a question about the repaving. Actually, I’d like some more details because you said that you redeploy everything every sprint. I’d like more details on how you do that in practice with respect to what exactly is everything in this case. Also, if you’re going from always same state to same state when you’re repaving, or does it include updates to your system at the same time?

Durai Arasan: It is multiple. There are automated scripts that are going and checking the lifecycle of the instances that are running. There is a time validity, and when there is, it takes off the route, getting any new request going in, let the existing one die down in terms of the request and all that. Then it shuts down, clean up the node, and then new instances get created. When the new instances get created, you can potentially have a new image pushed.

For example, there’s a zero-day vulnerability or some other security patch needs to be done, you can push the changes or you can just recreate that instance. We talked about the stale resources and things like that, that get cleaned up. You have an option of, you want to get a completely new image, that can be part of the policy. If you just wanted the new instances, that can be part of the policy. The policy determines what action to do. These are all automated. It watches all the instances that are running in, and based on that, it does in a different way. Not everything gets killed the same day, same time, but there is a cycle where it goes into it. It’s a very clean process at the same time with zero impact to the customer. That is our goal. We first take the traffic off, so that the customers are not impacted when this repave happens.

See more presentations with transcripts

Scaling Cloud and Distributed Applications: Lessons and Strategies From chase.com, #1 Banking Portal in the US

Transcript

Chase.com, Overview

What are Key Goals?

1. Scaling Efficiently

2. Highly Resilient

3. Performance

The Power of 5: Key Strategies

Multi-region

High Performance

Automation

Observability and Self-Healing

Robust Security

How Did We Do It? (Migration of chase.com)

Result

Summary, and Resources

Questions and Answers

Leave a Reply Cancel reply

Stay Connected

Latest News

Li Auto says it will push for “major overseas growth” this year · TechNode

It looks like LG is adding a Copilot app to TVs that can’t be deleted

The 2025 TikTok Awards Will Be a Star-Studded Event. Here's How to Watch Live

OnePlus 15R review: unbeatable battery life, beatable value

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Chase.com, Overview

What are Key Goals?

1. Scaling Efficiently

2. Highly Resilient

3. Performance

The Power of 5: Key Strategies

Multi-region

High Performance

Automation

Observability and Self-Healing

Robust Security

How Did We Do It? (Migration of chase.com)

Result

Summary, and Resources

Questions and Answers

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News