Less than a month ago, our team came upon a challenge that sounded deceptively simple. The task at hand was to create a dashboard that monitored transactions in real time for a particular financial platform we maintained. In principle, it was straightforward – Event data would be put into DynamoDB, and OpenSearch would be used for real-time analytics.
There was a simple expectation: For every single event that is recorded in DynamoDB, OpenSearch should be ready to analyze it instantly. No delays. No needless waiting.
And this is where we completely miscalculated.
When “Real-Time” Wasn’t Really Real-Time
I remember vividly our CTO’s reaction during the first demo. “Why is there such a lag?” he asked while pointing at the dashboard, which astonishingly was displaying metrics from almost 5 seconds ago.
We opted for lower than 1-second latency. What we provided was a system that sometimes, during surges of traffic, was as far behind as 3-5 seconds. Sometimes, it was worse. In finance monitoring, that kind of lag is as good as hours. Proactive or reactive? A system failure detected immediately versus finding out after customer complaints? That’s the true difference.
Some drastic changes were required and required fast.
Our First Attempt: The Traditional Batch Approach
Just like other teams that had come before us, we too leaned on familiar territory:
- Schedule an AWS Lambda job for execution every 5 seconds.
- Make it retrieve new records stored in DynamoDB.
- Collect these updates and group them into batches.
- Push the batches to OpenSearch for indexing.
It worked. Kind of. It is problematic that “working” and “fulfilling” are two terms that are completely divorced from each other.
Things crashed and burned at such an impressive rate:
-
Intrinsically Induced Delay: For 5 seconds, new data simply stayed put in DynamoDB waiting to be transferred, and could not be moved.
-
Latency in Indexing: Batching within OpenSearch caused great delays in queries, making it increasingly slow
-
Issues of Reliability: Crashing mid-batch meant that all updates in the batch were irretrievable.
During one case that was particularly annoying, our system missed certain fundamental error events inductively because they were encompassed in a batch failure. By the time the issue was diagnosed, thousands of transactions had already gone through the system.
“For the love of god, this isn’t sustainable.” He continued after recounting yet another report. “There is a clear need to change the system fundamentally and allow these updates to be streamed live instead of batched.”
He was exceptionally correct.
The Solution: Live Streaming Updates As They Occur
I was working on the AWS documentation one night for an extended period of time, and the solution hit me: DynamoDB Streams.
What if we could capture and process every single change made to the DynamoDB table the second it happened instead of having to pull updates in batches on a schedule?
This completely changed how we operate for the better:
-
Set up DynamoDB Streams to capture every insert, modification, and removal of the records
-
Add AWS Lambda functions that will process these changes instantly
-
Push the updates to OpenSearch and do some light processing on the data.
In my initial tests, the results were incredible. The latency dropped from 3-5 seconds to under 500ms. I will never forget the message I sent to the team Slack at 3 AM waiting for ‘We think we have cracked it’ responses.
Making it Work in Practice
This was not just a software engineering assignment or design project where we had to get a proof of concept pipeline working. In one of many sleepless nights over mountains of coffee, we decomposed our problem into three actionable steps. The first one was to get notified about changes in DynamoDB.
Getting Notified About Changes in DynamoDB
Challenge number one was, how do we know that something changed in DynamoDB? After some Googling, I discovered that we need to enable DynamoDB Streams. As it turned out, it was one CLI command away, albeit a painful one for me at first.
aws dynamodb update-table
--table-name RealTimeMetrics
--stream-specification StreamEnabled=true,StreamViewType=NEW_IMAGE
I still recall telling my coworker at 11 PM how excited I was when I got this to work: “It’s working! The table is streaming all changes it goes through!”
Creating a Change Event Listener
And now, with streams enabled, we needed something to catch those events. Thus, we built a Lambda function I decided to call “The Watchdog.” It waits for DynamoDB to inform it of a change, and as soon as it does, off to action it goes.
import json
import boto3
import requests
OPENSEARCH_URL = "https://your-opensearch-domain.com"
INDEX_NAME = "metrics"
HEADERS = {"Content-Type": "application/json"}
def lambda_handler(event, context):
records = event.get("Records", [])
for record in records:
if record["eventName"] in ["INSERT", "MODIFY"]:
new_data = record["dynamodb"]["NewImage"]
doc_id = new_data["id"]["S"] # Primary key
# Convert DynamoDB format to JSON
doc_body = {
"id": doc_id,
"timestamp": new_data["timestamp"]["N"],
"metric": new_data["metric"]["S"],
"value": float(new_data["value"]["N"]),
}
# Send update to OpenSearch
response = requests.put(f"{OPENSEARCH_URL}/{INDEX_NAME}/_doc/{doc_id}",
headers=HEADERS,
json=doc_body)
print(f"Indexed {doc_id}: {response.status_code}")
return {"statusCode": 200, "body": json.dumps("Processed Successfully")}
Looking simple now, the code was not so simple to write – it took me three attempts to operationalize. Our initial attempt continuously crashed due to a timeout because we were ignoring the response format from DynamoDB.
Teaching OpenSearch to Keep Up
That last issue was the most difficult one to solve and caught us off guard. Even with immediately updating OpenSearch, the updates were not available in real-time. It turns out that OpenSearch uses its own batching technique for simplicity.
“This makes no sense,” my teammate moaned. “We are sending real-time data and it does not show in real-time!”
curl -X PUT "https://your-opensearch-domain.com/metrics/_settings" -H 'Content-Type: application/json' -d ' { "index": { "refresh_interval": "500ms", "number_of_replicas": 1 } }'
After a bit of research and trial and error, we located the parameters we needed to modify. Making this modification made a huge difference. *It instructed OpenSearch to make new data available for search within half a second instead of waiting for its refresh cycle. I was about to leap out of my seat when I witnessed the first event show up in our dashboard milliseconds after it had been created in DynamoDB.”
Consequences: From Seconds to Milliseconds
The first week with the new system in production taught us a great deal. The dashboard was no longer a lagged view asynchronous to reality but rather a fully functioning service.
We achieved:
-
Average latency of < 500ms (was 3-5 seconds)
-
No more batch delays – propagation of changes was instantaneous
-
Zero indexing bottlenecks – smaller, more frequent updates were more efficient
-
Improved overall system resilience – no more ‘all or nothing’ batch failures
When we shared the updated dashboard with our leadership, they noticed the change immediately. The difference was clear; our CTO mentioned, “This is what we needed from the beginning.”
Further Scaling: Managing Traffic Peaks
As good as the new approach was for normal traffic, it had its difficulties during extreme usage spikes. During reconciliation periods at the end of the day, the rate of events would increase, ranging from hundreds up to thousands per second.
To mitigate this issue, we added Amazon Kinesis Firehose as a buffer. Instead of lambda sending every single update directly to OpenSearch, we changed it so that it would stream data out to Firehose:
firehose_client = boto3.client("firehose")
firehose_client.put_record(
DeliveryStreamName="MetricsStream",
Record={"Data": json.dumps(doc_body)}
)
Firehose took care of delivery to OpenSearch, scaling automatically in response to throughput requirements without compromising the pipeline’s real-time characteristics.
Lessons Learned: The Quest for Speed Goes On
We learned that with real-time data systems, constantly working to reduce lag time is a never-ending battle. We are trying even harder right now, and we have gotten our metrics lag down to 500ms:
-
Running transformation steps in OpenSearch Pipelines instead of in Lambda
-
Using AWS ElastiCache for faster retrieval of frequent queries
-
Considering edge computing for users spread around the globe
In financial monitoring, every microsecond counts. As one of my senior engineers says, “It’s not a pleasant place to be in monitoring; you’re either ahead of a problem or behind it. No in-between.”
What Have You Tried?
Have you tried solving issue of real-time data problems? What did you do that made a difference? I’m eager to learn about your adventures on the edges with AWS.