Key Takeaways
- Message brokers can be broadly categorized as either stream-based or queue-based, each offering unique strengths and trade-offs.
- Messages in a stream are managed using offsets, allowing consumers to efficiently commit large batches in a single network call and replay messages by rewinding the offset. In contrast, queues have limited batching support and typically do not allow message replay, as messages are removed once consumed.
- Streams rely on rigid physical partitions for scaling, which creates challenges in handling poison pills and limits their ability to dynamically auto-scale consumers with fluctuating traffic. Queues, such as Amazon SQS and FIFO SQS, use low-cardinality logical partitions (that are ordered), enabling seamless auto-scaling and effective isolation of poison pills.
- Streams are ideal for data replication scenarios because they enable efficient batching and are generally less susceptible to poison pills.
- When batch replication is not required, queues like Amazon SQS or FIFO SQS are often the better choice, as they support auto-scaling, isolate poison pills, and provide FIFO ordering when needed.
- Combining streams and queues allows organizations to standardize on a single stream solution for producing messages while giving consumers the flexibility to either consume directly from the stream or route messages to a queue based on the messaging pattern.
Messaging solutions play a vital role in modern distributed systems. They enable reliable communication, support asynchronous processing, and provide loose coupling between components. Additionally, they improve application availability and help protect systems from traffic spikes. The available options range from stream-based to queue-based services, each offering unique strengths and trade-offs.
In my experience working with various engineering teams, selecting a message broker is not generally approached with a clear methodology. Decisions are often influenced by trends, personal preference, or the ease of access to a particular technology; rather than the specific needs of an application. However, selecting the right broker should focus on aligning its key characteristics with the application’s requirements – this is the central focus of this article.
We will examine two of the most popular messaging solutions: Apache Kafka (stream-based) and Amazon SQS (queue-based), which are also the main message brokers we use at EarnIn. By discussing how their characteristics align (or don’t) with common messaging patterns, this article aims to provide insights that will help you make more informed decisions. With this understanding, you’ll be better equipped to evaluate other messaging scenarios and brokers, ultimately choosing the one that best suits your application’s needs.
Message Brokers
In this section, we will examine popular message brokers and compare their key characteristics. By understanding these differences, we can evaluate which brokers are best suited for common messaging patterns in modern applications. While this article does not provide an in-depth description of each broker, readers unfamiliar with these technologies are encouraged to refer to their official documentation for more detailed information.
Amazon SQS (Simple Queue Service)
Amazon SQS is a fully managed message queue service that simplifies communication between decoupled components in distributed systems. It ensures reliable message delivery while abstracting complexities such as infrastructure management, scalability, and error handling. Below are some of the key properties of Amazon SQS.
Message Lifecycle Management: In SQS, the message lifecycle is managed either individually or in small batches of up to 10 messages. Each message can be received, processed, deleted, or even delayed based on the application’s needs. Typically, an application receives a message, processes it, and then deletes it from the queue, which ensures that messages are reliably processed.
Best-effort Ordering: Standard SQS queues deliver messages in the order they were sent but do not guarantee strict ordering, particularly during retries or parallel consumption. This allows for higher throughput when strict message order isn’t necessary. For use cases that require strict ordering, FIFO SQS (First-In-First-Out) can be used to ensure that messages are processed in a certain order (more on FIFO SQS below).
Built-in Dead Letter Queue (DLQ): SQS includes built-in support for Dead Letter Queues (DLQs), which help isolate unprocessable messages.
Write and Read Throughput: SQS supports effectively unlimited read and write throughput, which makes it well-suited for high-volume applications where the ability to handle large message traffic efficiently is essential.
Autoscaling Consumers: SQS supports auto-scaling compute resources (such as AWS Lambda, EC2, or ECS services) based on the number of messages in the queue (see official documentation). Consumers can dynamically scale to handle increased traffic and scale back down when the load decreases. This auto-scaling capability ensures that applications can process varying workloads without manual intervention, which is invaluable for managing unpredictable traffic patterns.
Pub-Sub Support: SQS does not natively support pub-sub, as it is designed for point-to-point messaging where each message is consumed by a single receiver. However, you can achieve a pub-sub architecture by integrating SQS with Amazon Simple Notification Service (SNS). SNS allows messages to be published to a topic, which can then fan out to multiple SQS queues subscribed to that topic. This enables multiple consumers to receive and process the same message independently, effectively implementing a pub-sub system using AWS services.
Amazon FIFO SQS
FIFO SQS extends the capabilities of Standard SQS by guaranteeing strict message ordering within logical partitions called message groups. It is ideal for workflows that require the sequential processing of related events, such as user-specific notifications, financial transactions, or any scenario where maintaining the exact order of messages is crucial. Below are some of the key properties of FIFO SQS.
Message Grouping as Logical Partitions: In FIFO SQS, each message has a MessageGroupId, which is used to define logical partitions within the queue. A message group allows messages that share the same MessageGroupId to be processed sequentially. This ensures that the order of messages within a particular group is strictly maintained, while messages belonging to different message groups can be processed in parallel by different consumers. For example, imagine a scenario where each user’s messages need to be processed in order (e.g., a sequence of notifications or actions triggered by a user).
By assigning each user a unique MessageGroupId, SQS ensures that all messages related to a specific user are processed sequentially, regardless of when the messages are added to the queue. Messages from other users (with different MessageGroupIds) can be processed in parallel, maintaining efficient throughput without affecting the order for any individual user. This is a major benefit for FIFO SQS in comparison to standard SQS or stream based message brokers such as Apache Kafka and Amazon Kinesis.
Dead Letter Queue (DLQ): FIFO SQS provides built-in support for Dead Letter Queues (DLQs), but their use requires careful consideration as they can disrupt the strict ordering of messages within a message group. For example, if two messages – message1
and message2
– belong to the same MessageGroupId
(e.g., groupA
), and message1
fails and is moved to the DLQ, message2
could still be successfully processed. This breaks the intended message order within the group, defeating the primary purpose of FIFO processing.
Poison Pills Isolation: When a DLQ is not used, FIFO SQS will continue retrying the delivery of a failed message indefinitely. While this ensures strict message ordering, it can also create a bottleneck, blocking the processing of all subsequent messages within the same message group until the failed message is successfully processed or deleted.
Messages that repeatedly fail to process are known as poison pills. In some messaging systems, poison pills can block an entire queue or shard, preventing any subsequent messages from being processed. However, in FIFO SQS, the impact is limited to the specific message group (logical partition) the message belongs to. This isolation significantly mitigates broader failures, provided message groups are thoughtfully designed.
To minimize disruption, it’s crucial to choose the MessageGroupId in a way that keeps logical partitions small while ensuring that ordered messages remain within the same partition. For example, in a multi-user application, using a user ID as the MessageGroupId ensures that failures only affect that specific user’s messages. Similarly, in an e-commerce application, using an order ID as the MessageGroupId ensures that a failed order message does not impact orders from other customers.
To illustrate the impact of this isolation, consider a poison pill scenario:
- Without isolation (or shard-level isolation), a poison pill could block all orders in an entire region (e.g., all Amazon.com orders in a country).
- With FIFO SQS isolation, only a single user’s order would be affected, while others continue processing as expected.
Thus, poison pill isolation is a highly impactful feature of FIFO SQS, significantly improving fault tolerance in distributed messaging systems.
Throughput: FIFO SQS has a default throughput limit of 300 messages per second. However, by enabling high-throughput mode, this can be increased to 9,000 messages per second. Achieving this high throughput requires careful design of message groups to ensure sufficient parallelism.
Autoscaling Consumers: Similar to Standard SQS, FIFO SQS supports auto-scaling compute resources based on the number of messages in the queue. While FIFO SQS scalability is not truly unlimited, it is influenced by the number of message groups (logical partitions), which can be designed to be very high (e.g. a message group per user).
Pub-Sub Support: Just like with Standard SQS, pub-sub can be achieved by pairing FIFO SQS with SNS, which offers support for FIFO topics.
Apache Kafka
Apache Kafka is an open-source, distributed streaming platform designed for real-time event streaming and high-throughput applications. Unlike traditional message queues like SQS, Kafka operates as a stream-based platform where messages are consumed based on offsets. In Kafka, consumers track their progress by moving their offset forward (or backward for replay), allowing multiple messages to be committed at once. This offset-based approach is a key distinction between Kafka and traditional message queues, where each message is processed and acknowledged independently. Below are some of Kafka’s key properties.
Physical Partitions (shards): Kafka topics are divided into physical partitions (also known as shards) at the time of topic creation. Each partition maintains its own offset and manages message ordering independently. While partitions can be added, this may disrupt ordering and requires careful handling. On the other hand, reducing partitions is even more complex and generally avoided, as it affects data distribution and consumer load balancing. Because partitioning affects scalability and performance, it should be carefully planned from the start.
Pub-Sub Support: Kafka supports a publish-subscribe model natively. This allows multiple consumer groups to independently process the same topic, enabling different applications or services to consume the same data without interfering with each other. Each consumer group gets its own view of the topic, allowing for flexible scaling of both producers and consumers.
High Throughput and Batch Processing: Kafka is optimized for high-throughput use cases, enabling the efficient processing of large volumes of data. Consumers can process large batches of messages, minimizing the number of reads and writes to Kafka. For instance, a consumer can process up to 10,000 messages, save them to a database in a single operation, and then commit the offset in one step, significantly reducing overhead. This is a key differentiator of streams from queues where messages are managed individually or in small batches.
Replay Capability: Kafka retains messages for a configurable retention period (default is 7 days), allowing consumers to rewind and replay messages. This is particularly useful for debugging, reprocessing historical data, or recovering from application errors. Consumers can process data at their own pace and retry messages if necessary, making Kafka an excellent choice for use cases that require durability and fault tolerance.
Handling Poison Pills: In Kafka, poison pills can block the entire physical partition they reside in, delaying the processing of all subsequent messages within that partition. This can have serious consequences on an application. For example, in an e-commerce application where each region’s orders are processed through a dedicated Kafka shard, a single poison pill could block all orders for that region, leading to significant business disruptions. This limitation highlights a key drawback of strict physical partitioning compared to logical partitioning available in queues such as FIFO SQS, where failures are isolated within smaller message groups rather than affecting an entire shard.
If strict ordering is not required, using a Dead Letter Queue can help mitigate the impact by isolating poison pills, preventing them from blocking further message processing.
Autoscaling Limitations: Kafka’s scaling is constrained by its partition model, where each shard (partition) maintains strict ordering and can be processed by only one compute node at a time. This means that adding more compute nodes than the number of partitions does not improve throughput, as the extra nodes will remain idle. As a result, Kafka does not pair well with auto-scaling consumers, since the number of active consumers is effectively limited by the number of partitions. This makes Kafka less flexible in dynamic scaling scenarios compared to messaging systems like FIFO SQS, where logical partitioning allows for more granular consumer scaling.
Comparison of Messaging Brokers
Feature | Standard SQS | FIFO SQS | Apache Kafka |
---|---|---|---|
Message Retention | Up to 14 days | Up to 14 days | Configurable (default: 7 days) |
Pub-Sub Support | via SNS | via SNS | Native via consumer groups |
Message Ordering | Best-effort ordering | Guaranteed within a message group | Guaranteed within a physical partition (shard) |
Batch Processing | Supports batches of up to 10 messages | Supports batches of up to 10 messages | Efficient large-batch commits |
Write Throughput | Effectively unlimited | 300 messages/second per message group | Scalable via physical partitions (millions of messages/second achievable) |
Read Throughput | Unlimited | 300 messages/second per message group | Scalable via physical partitions (millions of messages/second achievable) |
DLQ Support | Built-in | Built-in but can disrupt ordering | Supported via connectors but can disrupt ordering of a physical partition |
Poison Pill Isolation | Isolated to individual messages | Isolated to message groups | Can block an entire physical partition |
Replay Capability | Not supported | Not supported | Supported with offset rewinding |
Autoscaling Consumers | Unlimited | Limited by the number of message groups (i.e. nearly unlimited in practice) | Limited by the number of physical partitions (shards) |
Messaging Patterns and Their Influence on Broker Selection
In distributed systems, messaging patterns define how services communicate and process information. Each pattern comes with unique requirements, such as ordering, scalability, error handling, or parallelism, which guide the selection of an appropriate message broker. This discussion focuses on three common messaging patterns: Command Pattern, Event-Carried State Transfer (ECST), and Event Notification Pattern, and examines how their characteristics align with the capabilities of popular brokers like Amazon SQS and Apache Kafka. This framework can also be applied to evaluate other messaging patterns and determine the best-fit message broker for specific use cases.
The Command Pattern
The Command Pattern is a design approach where requests or actions are encapsulated as standalone command objects. These commands are sent to a message broker for asynchronous processing, allowing the sender to continue operating without waiting for a response.
This pattern enhances reliability, as commands can be persisted and retried upon failure. It also improves the availability of the producer, enabling it to operate even when consumers are unavailable. Additionally, it helps protect consumers from traffic spikes, as they can process commands at their own pace.
Since command processing often involves complex business logic, database operations, and API calls, successful implementation requires reliability, parallel processing, auto-scaling, and effective handling of poison pills.
Key Characteristics
Multiple Sources, Single Destination: A command can be produced by one or more services but is typically consumed by a single service. Each command is usually processed only once, with multiple consumer nodes competing for commands. As a result, pub/sub support is unnecessary for commands.
High Throughput: Commands may be generated at a high rate by multiple producers, requiring the selected message broker to support high throughput with low latency. This ensures that producing commands does not become a bottleneck for upstream services.
Autoscaling Consumers: On the consumer side, command processing often involves time-consuming tasks such as database writes and external API calls. To prevent contention, parallel processing of commands is essential. The selected message broker should enable consumers to retrieve commands in parallel and process them independently, without being constrained by a small number of parallel workstreams (such as physical partitions). This allows for horizontal scaling to handle fluctuations in command throughput, ensuring the system can meet peak demands by adding consumers and scale back during low activity periods to optimize resource usage.
Risk of Poison Pills: Command processing often involves complex workflows and network calls, increasing the likelihood of failures that can result in poison pills. To mitigate this, the message broker must support high cardinality poison pill isolation, ensuring that failed messages affect only a small subset of commands rather than disrupting the entire system. By isolating poison pills within distinct message groups or partitions, the system can maintain reliability and continue processing unaffected commands efficiently.
Broker Alignment
Given the requirements for parallel consumption, autoscaling, and poison pill isolation, Kafka is not well-suited for processing commands. As previously discussed, Kafka’s rigid number of physical partitions cannot be scaled dynamically. Furthermore, a poison pill can block an entire physical partition, potentially disrupting a large number of the application’s users.
If ordering is not a requirement, standard SQS is an excellent choice for consuming and processing commands. It supports parallel consumption with unlimited throughput, dynamic scaling, and the ability to isolate poison pills using a Dead Letter Queue (DLQ).
For scenarios where ordering is required and can be distributed across multiple logical partitions, FIFO SQS is the ideal solution. By strategically selecting the message group ID to create numerous small logical partitions, the system can achieve near-unlimited parallelism and throughput. Moreover, any poison pill will only affect a single logical partition (e.g., one user of the application), ensuring that its impact is isolated and minimal.
Event-carried State Transfer (ECST)
The Event-Carried State Transfer (ECST) pattern is a design approach used in distributed systems to enable data replication and decentralized processing. In this pattern, events act as the primary mechanism for transferring state changes between services or systems. Each event includes all the necessary information (state) required for other components to update their local state without relying on synchronous calls to the originating service.
By decoupling services and reducing the need for real-time communication, ECST enhances system resilience, allowing components to operate independently even when parts of the system are temporarily unavailable. Additionally, ECST alleviates the load on the source system by replicating data to where it is needed. Services can rely on their local state copies rather than making repeated API calls to the source. This pattern is particularly useful in event-driven architectures and scenarios where eventual consistency is acceptable.
Key Characteristics
Single Source, Multiple Destinations: In ECST, events are published by the owner of the state and consumed by multiple domains or services interested in replicating the state. This requires a message broker that supports the publish-subscribe (pub-sub) pattern.
Low Likelihood of Poison Pills: Since ECST involves minimal business logic and typically avoids API calls to other services, the risk of poison pills is negligible. As a result, the use of a Dead Letter Queue (DLQ) is generally unnecessary in this pattern.
Batch Processing: As a data-replication pattern, ECST benefits significantly from batch processing. Replicating data in large batches improves performance and reduces costs, especially when the target database supports bulk inserts in a single operation. A message broker that supports efficient large-batch commits, combined with a database optimized for batching, can dramatically enhance application performance.
Strict Ordering: Strict message ordering is often essential in ECST to ensure that the state of a domain entity is replicated in the correct sequence. This prevents older versions of an entity from overwriting newer ones. Ordering is particularly critical when events carry deltas (e.g., “set property X”), as out-of-order events cannot simply be discarded. A message broker that supports strict ordering can greatly simplify event consumption and ensure data integrity.
Broker Alignment
Given the requirements for pub-sub, strict ordering, and batch processing, along with the low likelihood of poison pills, Apache Kafka is a great fit for the ECST pattern.
Kafka allows consumers to process large batches of messages and commit offsets in a single operation. For example, 10,000 events can be processed, written to the database in a single batch (assuming the database supports it), and committed with one network call, making Kafka significantly more efficient than Amazon SQS in such scenarios. Furthermore, the minimal risk of poison pills eliminates the need for DLQs, simplifying error handling. In addition to its batching capabilities, Kafka’s partitioning mechanism enables increased throughput by distributing events across multiple shards.
However, if the target database does not support batching, writing data to the database may become the bottleneck, rendering Kafka’s batch-commit advantage less relevant. For such scenarios, funneling messages from Kafka into FIFO SQS or using FIFO SNS/SQS without Kafka can be more effective. As discussed earlier, FIFO SQS allows for fine-grained logical partitions, enabling parallel processing while maintaining message order. This design supports dynamic scaling by increasing the number of consumer nodes to handle traffic spikes, ensuring efficient processing even under heavy workloads.
Event Notification Pattern
The Event Notification Pattern enables services to notify other services of significant events occurring within a system. Notifications are lightweight and typically include just enough information (e.g., an identifier) to describe the event. To process a notification, consumers often need to fetch additional details from the source (and/or other services) by making API calls. Furthermore, consumers may need to make database updates, create commands or publish notifications for other systems to consume. This pattern promotes loose coupling and real-time responsiveness in distributed architectures. However, given the potential complexity of processing notifications (e.g. API calls, database updates and publishing events), scalability and robust error handling are essential considerations.
Key Characteristics
The characteristics of the Event Notification Pattern overlap significantly with those of the Command pattern, especially when processing notifications involves complex and time consuming tasks. In these scenarios, implementing this pattern requires support for parallel consumption, autoscaling consumers, and isolation of poison pills to ensure reliable and efficient processing. Moreover, the Event Notification Pattern necessitates PubSub support to facilitate one-to-many distribution of events.
There are cases when processing notifications involve simpler workflows, such as updating a database or publishing events to downstream systems. In such cases, the characteristics of this pattern align more closely with those of the ECST pattern.
It should also be noted that different consumers of the same notification may process notifications differently. It’s possible that one consumer needs to apply complex processing while another is performing very simple tasks that are unlikely to ever fail.
Broker Alignment
When the characteristics of the notifications consumer align with those of consuming commands, SQS (or FIFO SQS) is the obvious choice. However, if a consumer only needs to perform simple database updates, consuming notifications from Kafka may be more efficient because of the ability to process notifications in batches and Kafka’s ability to perform large batch commits.
The challenge with notifications is that it’s not always possible to predict the consumption patterns in advance, which makes it difficult to choose between SNS vs Kafka when producing notifications.
To gain more flexibility, at EarnIn we have decided to use Kafka as the sole broker for publishing notifications. If a consumer requires SQS properties for consumption, it can funnel messages from Kafka to SQS using AWS event bridge. If a consumer doesn’t require SQS properties, it can consume directly from Kafka and benefit from its efficient batching capabilities. Moreover, using Kafka instead of SNS for publishing notifications also provides consumers the ability to leverage Kafka’s replay capability, even when messages are funneled to SQS for consumption.
Furthermore, given that Kafka is also a good fit for the ECST pattern and that the command pattern doesn’t require PubSub, we had no reasons left to use SNS. This allowed us to standardize on Kafka as the sole PubSub broker, which significantly simplifies our workflows. In fact, with all events flowing through Kafka, we were able to build tooling that allowed us to replicate Kafka events to a DataLake, which can be leveraged for debugging, analytics, replay / backfilling and more.
Conclusion
Selecting the right message broker for your application requires understanding the characteristics of the available options and the messaging pattern you are using. Key factors to consider include traffic patterns, auto-scaling capabilities, tolerance to poison pills, batch processing needs, and ordering requirements.
While this article focused on Amazon SQS and Apache Kafka, the broader decision often comes down to choosing between a queue and a stream. However, it is also possible to leverage the strengths of both by combining them.
Standardizing on a single broker for producing events allows your company to focus on building tooling, replication, and observability for one system, reducing maintenance costs. Consumers can then route messages to the appropriate broker for consumption using services like EventBridge, ensuring flexibility while maintaining operational efficiency.