How Many Seconds Does an RDS Failover Take - Multi-AZ Switchover Mechanism and DNS Propagation Internals
This article breaks down the 60-120 seconds of an RDS Multi-AZ failover into its component phases - failure detection, DNS record update, and connection re-establishment - and explains practical techniques to reduce failover time.
Multi-AZ Basic Architecture - Primary and Standby
In an RDS Multi-AZ deployment, the primary instance and standby instance are placed in different Availability Zones. Writes to the primary are replicated to the standby through synchronous replication. Synchronous replication means the primary confirms that data transfer to the standby is complete before committing a transaction. This ensures no data loss during failover. The standby instance does not accept application requests during normal operation. Read replicas are used to distribute read traffic. The standby's role is to maintain a state where it can be promoted immediately when the primary fails. Applications connect to the database using the endpoint (CNAME) provided by RDS. This endpoint is a DNS record that normally resolves to the primary instance's IP address. During failover, this DNS record is updated to the standby instance's IP address.
The Three Phases of Failover
RDS failover consists of three phases. Phase 1 is failure detection. It takes several seconds to tens of seconds for the RDS monitoring system to detect a primary instance failure. Detection time varies by failure type. Instance crashes are detected within seconds, but partial network failures (increased packet loss) take longer to detect. Phase 2 is DNS record update. The standby instance is promoted to primary, and the RDS endpoint's DNS record is updated to the standby's IP address. The DNS record TTL (Time To Live) is typically set to 5 seconds, but if the application or operating system caches DNS, it may continue connecting to the old IP address. Phase 3 is connection re-establishment. This is the time for the application to establish connections to the new primary. When using a connection pool, old connections in the pool are invalid and new connections must be created.
Breaking Down the 60-120 Seconds
According to AWS official documentation, RDS Multi-AZ failover time is "typically 60-120 seconds." Breaking down this time: failure detection takes 5-30 seconds, standby promotion takes 10-30 seconds, DNS record update and propagation takes 5-30 seconds, and application connection re-establishment takes 5-30 seconds. The range in failover time exists because it depends on the type of failure and the database state. When there are many uncommitted transactions, recovery processing (rollback) is needed during standby promotion, extending the time. If a failover occurs during a DDL operation (ALTER TABLE) on a large table, recovery can take several minutes. Aurora failover is faster than RDS. Because Aurora's storage layer is shared, no data recovery is needed during standby promotion. Aurora failover time is typically under 30 seconds, and Aurora Serverless v2 reduces it even further.
Techniques to Reduce Failover Time
There are practical techniques to minimize the impact of failover. First, respect DNS cache TTL. Java's JVM caches DNS indefinitely by default (networkaddress.cache.ttl=-1). With this setting, it continues connecting to the old IP address after failover. Set the JVM's DNS cache TTL to a short value (e.g., 5-30 seconds). Second, use RDS Proxy. RDS Proxy sits between the application and database, automatically redirecting connections to the new primary during failover. Since the application connects to the RDS Proxy endpoint, there is no need to wait for DNS propagation. AWS reports that using RDS Proxy can reduce failover downtime by up to 66%. Third, implement retry logic on the application side. Implement retries with exponential backoff for connection errors during failover. The first retry after 1 second, the next after 2 seconds, then 4 seconds, and so on.
Testing Failover - Be Prepared Before Production
Failover should never be experienced for the first time in production. RDS provides the ability to manually trigger a failover. In the console, check "Reboot with failover" in the reboot options, or use the CLI command reboot-db-instance --force-failover. There are four items to verify during a failover test. First, whether the application automatically reconnects to the new primary. Second, whether retry logic works correctly when requests fail during failover. Third, the actual time it took for failover to complete. Fourth, whether data integrity is maintained after failover. Setting up CloudWatch RDSEventNotification allows you to receive SNS notifications for failover start and completion. Failover events are also recorded in CloudTrail for post-incident analysis. To systematically learn RDS availability design, specialized books on Amazon are a helpful reference.