Everyday around 0.4 Zettabytes of data (equivalent to 402.74 million Petabytes or 402.74 billion Gigabytes) is generated, which is an astonishing fact to learn. This data can be generated through different sources and in different categories across many organizations. Organizations can use this large amount of data to try to understand customer behavior or usage patterns as well as come up with insights that can help the organization in serving their customers.
Real-time analytics help organizations to derive insights from the data as it is generated, enabling them to take appropriate actions on time. This capability of using analytics to take decisions is essential for many businesses across multiple industries to avoid missed opportunities or significant losses.
Building the analytics platforms for any organization is not an easy task due to the complexities involved in handling heterogeneous data, multiple data sources, and large volumes. AWS Cloud provides multiple solutions to help the businesses to process and visualize the huge amounts of data produced. Imagine being able to track the devices in real time and responding to the changes as needed. AWS makes it easy to analyze data streams, turn raw information into actionable insights without us managing the infrastructure ourselves. Whether it’s monitoring IoT devices, tracking application performance, processing data streams or visualizing insights, AWS offers the solutions needed.
This article aims to walk through building such solutions with AWS tools, from data ingestion to insightful dashboards. We will explore the architecture, practical use cases, and best practices using tools like AWS Kinesis, S3, Athena, Timestream, OpenSearch, and more.
Key Elements of the Pipeline
A scalable and reliable analytics platform usually consists of multiple components. These components work together to hide the internal complexities from the end users of the platform. At high level, the pipeline to build the analytics platform contain multiple components:
- Data Sources: The data originates from multiple systems or applications.
- Data Ingestion: Capture and stream the real-time data from sources into the pipeline.
- Data Processing: Real-time transformation and aggregation of data.
- Data Storage: Solution to store the data optimized for low-latency, high throughput writes and fast queries.
- Data Visualization: Real-time reports and insights for decision-making.
Each one of these components plays critical role in the pipeline. However, in some use cases, few of these components may not be present or they are merged into a single solution.
Building a Real-Time Analytics Pipeline on AWS
This section illustrates the steps to setup up analytics platform on AWS with four major components.
-
Data Sources
The data can originate from different sources, such as user interactions on a website, sensor data from IoT devices, events generated through user actions, user posts in social media platforms, etc. The data sources can feed the data into the pipeline through APIs or other integrations.
-
Data Ingestion
Data Ingestion serves as vital first step in the transformation of raw data into actionable insights. Few AWS tools that can be used for Data Ingestion are:
-
Kinesis Data Streams
AWS Kinesis is a powerful service to collect, process and analyze the streaming data. Kinesis consists of multiple components including Kinesis Data Streams which is the core service that is used for ingestion of large volumes of data from various sources. It integrates seamlessly with other AWS services providing flexibility.
For instance, to create an analytics platform using Kinesis and S3, Kinesis Data Streams can be configured as data ingestion tool. The cloudformation template for this example is located here.
The following cloudformation snippet shows how to configure a Kinesis Data Streams:
KinesisDataStream: Type: AWS::Kinesis::Stream Properties: Name: !Ref KinesisStreamName RetentionPeriodHours: 72 StreamModeDetails: StreamMode: ON_DEMAND
-
Managed Streaming for Apache Kafka (MSK)
In general, Kinesis is designed to be easier and faster to set up with no upfront costs. If a team already has experience setting up and using Apache Kafka, it could be a good fit to use Amazon MSK.
-
Amazon SQS (Simple Queue Service)
SQS is a highly scalable and reliable queue service can be used as a buffer for data ingestion. It’s integration with Lambda allows easier data flow in the pipeline.
-
-
Data Processing
The data or events received from the sources may not be in correct format to be pushed to the storage platforms. The ability to process and transform data efficiently is crucial for the performance and to derive insights.
AWS Lambda can be used to create scalable, reliable serveless solutions that provide data processing capabilities. Lambda functions can be invoked on various events, such as data being added to Kinesis Stream or SQS. The Lambda functions can be used for transformation of data from raw format into required formats. This event driven architecture makes it an ideal choice for real-time data processing.
Kinesis Data Firehose allows for the delivery of real-time data streams to targets such as S3 and Redshift. It supports real-time data producers integrated with Kinesis Data Streams as well as custom applications that can directly write to Kinesis Data Firehose.
For the sample platform, following snippet shows how to create Kinesis Firehose:
FirehoseDeliveryStream: Type: AWS::KinesisFirehose::DeliveryStream Properties: DeliveryStreamName: !Sub ${KinesisStreamName}-firehose DeliveryStreamType: KinesisStreamAsSource KinesisStreamSourceConfiguration: KinesisStreamARN: !GetAtt KinesisDataStream.Arn RoleARN: !GetAtt FirehoseRole.Arn ExtendedS3DestinationConfiguration: BucketARN: !GetAtt S3Bucket.Arn RoleARN: !GetAtt FirehoseRole.Arn Prefix: raw-data/ CompressionFormat: UNCOMPRESSED CloudWatchLoggingOptions: Enabled: true LogGroupName: !Sub "/aws/kinesisfirehose/${KinesisStreamName}" LogStreamName: "error-logs"
-
Data Storage
AWS provides suite of services that are optimal to store vast amount of data and allow quick access and analysis. The following tools are highly scalable, durable and cost-effective solutions in AWS ecosystem to store high volumes of data.
-
S3 (Simple Storage System)
Data ingested through Kinesis can be easily stored in S3 buckets for further processing and analysis. S3 support various data formats, including CSV, JSON, Parquet, and ORC, providing flexibility in data processing. Once data is stored in S3, querying becomes simple for generating insights using tools such as Athena, a SQL query service for data stored in S3.
-
Redshift
For organizations requiring more robust data warehousing capabilities, AWS Redshift is the solution. It is petabyte-scale data warehouse that provides capabilities to run complex queries across large datasets.
-
TimeStream
TimeStream is an ideal solution for building analytics based on metrics or events over time. In my previous article, I have explored a complete example using TimeStream to build an analytics platform.
-
OpenSearch
Amazon’s open-source service supports searching, analyzing logs, and monitoring applications and other functionalities. OpenSearch Ingestion is a tool for filtering and transforming and ingesting data into OpenSearch from other sources.
-
-
Visualizations and Insights
It is important to have visualizations to effectively use the actionable insights derived from the processed data.
AWS Quicksight is a powerful business Intelligence service that allows users to create dashboards and visualizations from the data. It’s seamless integration with many other AWS services makes it advantageous for the whole pipeline. It can connect seamlessly to data stored in S3, Redshift, Athena, TimeStream and many other data sources. Users can create interactive dashboards by importing datasets directly from the data sources. QuickSight also provides machine learning capabilities, allowing users to access trends and patterns in the data.
AWS managed Grafana is a fully managed service provided by AWS that provides data visualization, monitoring and analytics capabilities. Users can create interactive dashboards and visualizations with integrations across multiple data sources. It seamlessly integrates with several AWS services including TimeStream.
Conclusion
A real-time analytics platform allows organization to understand their customers better and serve them based on their needs. As businesses are scaling their operations, leveraging AWS tools and capabilities will help them in data-driven decision making. The use of such platforms not only help the businesses to stay ahead of the competition but also helps in continuous improvement and creating innovative products.