Does SQS Really Deliver Messages "At Least Once"? - How At-Least-Once Delivery Works and Its Pitfalls
Explore why SQS Standard queues produce duplicates under At-Least-Once delivery, how visibility timeout works internally, the differences from FIFO queue's Exactly-Once processing, and why idempotent design is essential.
What Is At-Least-Once Delivery?
SQS Standard queues guarantee "At-Least-Once Delivery," meaning every sent message will be delivered to a consumer at least once, but the same message may be delivered two or more times. Why do duplicates occur? SQS stores messages redundantly across multiple servers. Between the time a consumer receives a message and sends a delete request after completing processing, the same message may be delivered to another consumer from a different server. In distributed systems, network latency and synchronization delays between servers make this type of duplication fundamentally unavoidable. The frequency of duplicate delivery is not publicly disclosed, but AWS documentation states it occurs "rarely." In practice, environments processing thousands of messages per second have reported observing a few to several dozen duplicates per day.
Visibility Timeout - How Messages Become "Invisible"
SQS's visibility timeout is a mechanism designed to prevent duplicate processing of messages. When a consumer receives a message, that message becomes "invisible" to other consumers for a specified period. This period is the visibility timeout, which defaults to 30 seconds. If the consumer completes processing and deletes the message via the DeleteMessage API within the visibility timeout, the message is successfully processed. If the message is not deleted within the visibility timeout, it returns to the queue and may be delivered to another consumer. Here lies the pitfall: if processing takes longer than 30 seconds, the visibility timeout expires, the message returns to the queue, and another consumer picks up the same message, resulting in duplicate processing. There are two countermeasures. First, set the visibility timeout significantly longer than the expected processing time. Second, extend the visibility timeout during processing using the ChangeMessageVisibility API (the heartbeat pattern). When Lambda is triggered by SQS, the Lambda service automatically sets the visibility timeout to six times the function's timeout value.
Exactly-Once Processing with FIFO Queues
FIFO (First-In-First-Out) queues, introduced in 2016, provide message ordering guarantees and deduplication. With FIFO queues, specifying a MessageDeduplicationId when sending a message ensures that messages with the same ID are delivered only once within a 5-minute deduplication window. This eliminates duplicate sends from the producer side, such as retries caused by network timeouts. Duplicate processing on the consumer side is also mitigated by ordering guarantees via MessageGroupId. Messages with the same MessageGroupId are not delivered until the previous message is deleted, preventing duplicate processing from parallel consumption within the same group. However, FIFO queues have throughput limitations: 300 messages per second without batching, and 3,000 messages per second with batching. Enabling high-throughput mode extends this to 30,000 messages per second, but this is still significantly lower than the virtually unlimited throughput of Standard queues. For workloads that do not require ordering guarantees or deduplication, Standard queues are more advantageous in terms of throughput and cost.
Idempotent Design - Architecture That Assumes Duplicates
When using SQS Standard queues, ensuring idempotency on the consumer side is essential. Idempotency means that performing the same operation multiple times produces the same result. For example, "set User A's balance to 1,000 yen" is idempotent, but "add 1,000 yen to User A's balance" is not. Executing the latter twice would result in a balance of 2,000 yen. The most common pattern for achieving idempotency is recording processed message IDs in DynamoDB. When a message is received, first check whether the MessageId exists in DynamoDB. If it does, skip processing; if not, execute the processing and record the MessageId. Using DynamoDB's conditional writes (ConditionExpression), the check and record can be performed atomically. The Lambda Powertools library provides a decorator that makes implementing this idempotency pattern straightforward. Simply adding the @idempotent decorator to a function automatically incorporates DynamoDB-based idempotency checks.
Dead-Letter Queues - Where Unprocessable Messages Go
When message processing fails repeatedly, the message remains in the queue indefinitely, wasting consumer resources. A dead-letter queue (DLQ) moves messages that have failed processing a specified number of times to a separate queue. Setting maxReceiveCount to 3 means messages that have been received three times without being deleted are moved to the DLQ. Messages in the DLQ can be manually inspected to investigate the cause, then either resent to the original queue after fixing the issue or discarded. An often-overlooked aspect of DLQ design is the message retention period of the DLQ itself. SQS's default message retention period is 4 days, with a maximum of 14 days. The same retention period applies to DLQ messages, so if they are not addressed within 14 days, they are automatically deleted. For critical messages, set up a CloudWatch alarm on the DLQ to receive immediate notifications when messages arrive. The DLQ redrive feature, added in 2021, allows you to resend DLQ messages back to the original queue with a single click from the console. To systematically learn message queue design patterns, specialized books on Amazon are a helpful reference.