Using an in-memory side cache to improve response times and alleviate the load on a database is one of the most common design patterns seen in web applications. This design scales well and has low setup and maintenance costs. A side cache can be added to an existing web application backend to speed up the loading of all types of data, from database records to larger chunks like complete JSON or HTML responses. By setting an expiration time, “lazy caching” (or “lazy loading”) can be implemented, meeting the user’s tolerance for stale data without the complexities of cache invalidation. Many excellent open-source solutions are available both on-premises and in the cloud, memcache and redis are considered the most popular. A caching scheme like this has been used in all the projects I have worked on.
As system load increases, challenges inevitably emerge. Imagine your web application processing millions of requests daily. During peak traffic hours, users experience significant delays and occasional errors. To mitigate these issues, you expand your caching infrastructure by adding additional instances, expecting the increased cache size to resolve the bottlenecks. However, the performance issues persist, and your application continues to struggle under the load. This scenario, which I encountered repeatedly in the projects I worked on, necessitates a careful re-evaluation of our approach to caching under high demand.
Large technology companies often address such challenges with sophisticated software and intricate architectural changes. However, these solutions are not always accessible or cost-effective for smaller teams or organizations. Techniques such as implementing middleware proxies for request batching or adopting customized communication protocols may not align with your system’s constraints. This article presents a practical method that my colleagues and I successfully employed to address similar challenges in large-scale projects.
Latency spikes at top percentiles
One of the frequent problems we encountered was high response times observed in top percentiles like p99, while mean and median times looked great. These high response times occurred due to the need to make database reads when cache keys were missing.
I created a simulation to represent a typical workload for a web application we used to have (see the Simulation section below for more details). This model application handles 100 requests per second for a single type of abstract key. The keys’ popularity follows a Pareto distribution, with two-thirds of user requests concentrated among the first 1,000 keys out of a total of 1 million keys.
Mean cache read time is about 50 times higher than database read time. Cache expiration time is 20 minutes.
The p99 response times for these top 1000 keys are extremely high (top left chart), while the mean and median response times are within an acceptable threshold. The long tail of less popular keys isn’t considered here because their hit ratio is significantly lower, thus making caching much less efficient for them beforehand.
But are high values at 99th percentile for popular keys that bad? That means that just about 1% of key reads are slow. Though it depends on your application and its usage patterns, this is generally unfavourable. Think about the following real-world scenarios:
- A single user’s request fetches hundreds of keys — a typical case for social and news feeds etc. Though only 1 out of 100 keys are slow but almost every user request is slow because of that;
- Hundreds of users request the same key at the same moment — a common thing pattern for any popular site. When this key expires, all these requests go straight to the database, multiplying its load, slowing down the whole backend.
I want to emphasize that this happens not to “some requests” but to the most popular ones, which represent the core of your users’ demand.
Setting keys expiration time to infinity could solve the problem with peaking loading times. Unfortunately, this also denies “lazy caching” and demands cache invalidation mechanisms. Is it possible to eliminate the issue of key expiration while still providing data that is acceptably up to date?
Dynamically extending keys expiration time
The desire to serve fresh data while maintaining low response times and database load is inherently contradictory. The ideal solution would be to know in advance if a retrieved key from the cache is about to expire and update it if necessary, preferably in the background. Here is an example solution written in pseudo-python code:
cache_response = cache.get(key)
if cache_response is not None:
if random() < α:
run async: # this block will run in background and not delay the response
db_response = dabase.get(key)
if db_response is not None:
# note cas_token, see explanation for CAS below
cache.set(key, db_response.value, cache_response.cas_token)
# returning value to the user
return cache_response.value
We read a key from cache and, if successful, with small probability set its value again, also extending the expiration time. This involves reading the key from the database to check for updates.
To avoid race conditions, CAS or Check and Set mode must be used. CAS ensures that the cache update only happens if the data hasn’t been modified by another process since it was last read, maintaining data consistency.
The key set should be done asynchronously to avoid delaying user responses. Even if it isn’t possible, it may be still beneficial to update the key blocking the user.
Choosing the optimal probability α depends on your traffic pattern and is generally recommended to be between 1% and 10%. Higher values increase database load without significantly affecting popular keys.
In systems where keys are not frequently updated, expiration times are longer, and request rates are high, an additional strategy can be implemented. Keys may be updated with probability α, but only if their remaining expiration time is below t.
This technique requires the ability to read the expiration time. Redis provides the ttl command for this purpose, while memcache does not. This can be addressed by storing the expiration time alongside the value, for example, packed in JSON.
Simulation
Modifying caching behavior of your system in production is risky, to say the least, so it’s safer to apply changes gradually. However, due to the unpredictable nature of the incoming traffic, distinguishing small improvements from noise requires complex data analysis. Development or staging environments proved to be not much better in my cases as they had lower traffic.
To tackle this, I used discrete simulation tools for ad-hoc simulations. This approach allowed me to safely test any system behavior I wanted to research much faster than running such experiments in a development environment.
While I was composing this article, I compiled numerous code snippets utilized during these experiments into a single tool and made it available through a github repository: https://github.com/whisk/cachestudy.
Conclusion
The expiration time extension technique I described in this article shows that improving web application performance doesn’t always require expensive tools or complex changes. By addressing common caching challenges developers can significantly boost efficiency and handle high traffic with simple, cost-effective solutions.
However, these improvements must be carefully tested in production environments to ensure they work as intended and don’t create new problems. Simulations and pre-production environments often lack the complexity and scale of real-world traffic, and only by testing under actual conditions can you ensure your optimizations provide tangible gains without introducing new issues.
Misc
Cover image generated by DALL·E.