How AWS API Throttling Works - The Token Bucket Algorithm and the Truth Behind 429 Errors
Learn how AWS API rate limiting is implemented using the token bucket algorithm, understand the concept of burst capacity, explore differences in throttling limits across services, and discover practical strategies to avoid throttling.
Why AWS Applies Rate Limits to Every API
Every AWS API has per-account, per-region rate limits (throttling). Requests that exceed these limits receive HTTP 429 (Too Many Requests) or 503 (Service Unavailable) errors. Rate limiting serves two purposes. First, it ensures fairness in a multi-tenant environment. If one account makes massive API calls, it can affect the performance of other accounts sharing the same infrastructure. Rate limits are guardrails that prevent the "noisy neighbor problem." Second, it protects customers themselves. Application bugs can cause infinite loops that call APIs tens of thousands of times per second. Without rate limits, such runaway behavior would lead to enormous bills. Rate limits function as a safety net for detecting unintended runaway behavior early. Each service's rate limits can be checked in Service Quotas, and many limits can be raised through limit increase requests.
How the Token Bucket Algorithm Works
AWS API throttling is implemented using the token bucket algorithm. This algorithm works by replenishing tokens (permits) into a bucket (container) at a constant rate, with each API request consuming one token. When the bucket is empty, requests are rejected. Here's a concrete example. Suppose the EC2 DescribeInstances API has a rate limit of 100 requests per second with a burst capacity of 200. The bucket is replenished with 100 tokens per second, and the maximum bucket capacity is 200 tokens. Under normal conditions, the bucket is full (200 tokens), so you can send 200 requests instantaneously (burst). After that, you can only process requests at a pace of 100 per second. Burst capacity is a buffer that absorbs short-term spikes. In patterns where APIs are called simultaneously at application startup, burst capacity becomes critical. After the burst is exhausted, you are limited to the steady-state rate (100 requests per second).
Throttling Granularity Varies by Service
Throttling granularity varies significantly across services. EC2 APIs have individual rate limits set per API action. DescribeInstances and RunInstances are managed in separate buckets, so throttling on DescribeInstances does not affect RunInstances. DynamoDB throttling, on the other hand, is applied at the table level. Requests exceeding a table's provisioned capacity (RCU/WCU) are throttled. This is a data access throughput limit, different from API-level throttling. Lambda's concurrent execution limit is also a form of throttling. Function invocations exceeding the account's default concurrent execution limit of 1,000 are throttled with 429 errors. API Gateway has an account-level rate limit of 10,000 requests per second (default), with additional throttling settings configurable per API, per stage, and per method. This multi-layered throttling ensures that concentrated access to a specific API endpoint does not affect other endpoints.
Exponential Backoff and Jitter - The Retry Strategy SDKs Handle Automatically
The correct response to a throttling error (429) is to retry with a combination of exponential backoff and jitter. Exponential backoff is a strategy that increases retry intervals exponentially: 1 second, 2 seconds, 4 seconds, 8 seconds, and so on. This gradually reduces request pressure on the throttled service. Jitter adds random variation to retry intervals. With exponential backoff alone, multiple clients throttled simultaneously would retry at the same time, causing throttling again in a "thundering herd" problem. Adding jitter distributes the timing of retries. AWS SDKs automatically implement this retry strategy internally. The default for SDK v3 (JavaScript) is a maximum of 3 retries with exponential backoff and full jitter applied. boto3 (Python) has a similar retry strategy. If you call APIs directly without using an SDK, you need to implement this retry logic yourself.
Design Patterns to Proactively Avoid Throttling
Rather than retrying after throttling occurs, the ideal approach is to design systems that prevent throttling in the first place. The first pattern is reducing API calls. Instead of calling EC2's DescribeInstances every second to monitor instance state, you can use EventBridge events (EC2 Instance State-change Notification) to receive notifications only when state changes occur. Shifting from polling to event-driven architecture dramatically reduces API call volume. The second pattern is leveraging caching. Information that doesn't change frequently (account settings, region lists, etc.) can be cached locally to reduce API calls. The third pattern is using batch APIs. DynamoDB's BatchGetItem can retrieve up to 100 items in a single API call. Compared to calling GetItem 100 times individually, this reduces API call count by 99%. S3's ListObjectsV2 can also retrieve up to 1,000 objects per request using the MaxKeys parameter. To systematically learn about API design and throttling strategies, specialized books (Amazon) can be helpful.