How AWS Keeps Time Internally - Amazon Time Sync Service and Leap Second Smearing Design
Learn how Amazon Time Sync Service works, how GPS and atomic clocks provide high-precision time sources, the design decision to absorb leap seconds through smearing, and why time synchronization matters in distributed systems.
Why Time Synchronization Matters in the Cloud
In distributed systems, accurate time synchronization is more important than you might think. CloudTrail logs, CloudWatch metrics, DynamoDB conditional writes, S3 object versioning, TLS certificate expiration validation - nearly every AWS service depends on accurate time. When clocks drift, log timelines become disordered, making incident investigation difficult. TLS certificate expiration checks can malfunction, causing valid certificates to be rejected as expired. Kerberos authentication (Active Directory) rejects authentication when the time difference between client and server exceeds 5 minutes. In distributed databases, clock drift directly impacts data consistency. Google's Spanner database uses atomic clocks and GPS to provide the TrueTime API precisely because it guarantees distributed transaction consistency through clock precision. AWS addresses the same challenge with its own solution: Amazon Time Sync Service.
How Amazon Time Sync Service Works
Amazon Time Sync Service is a high-precision time source deployed in each AWS Region. EC2 instances can access the NTP (Network Time Protocol) server at the link-local address 169.254.169.123. Like the metadata service at 169.254.169.254, this link-local address works independently of network configuration and is available immediately after instance launch. The time sources for Time Sync Service are GPS antennas and atomic clocks (rubidium or cesium) deployed in each Region. GPS satellites carry atomic clocks and provide nanosecond-precision time information. AWS cross-references GPS time information with local atomic clocks to maintain accurate time even if the GPS signal is temporarily lost. Latency from EC2 instances to Time Sync Service is on the order of microseconds, offering orders-of-magnitude better precision compared to public NTP servers on the internet (such as pool.ntp.org). While public NTP servers are limited to millisecond-level precision, Time Sync Service achieves microsecond-level precision.
Leap Second Smearing - Avoiding 23:59:60
A leap second is a mechanism that inserts one second into UTC (Coordinated Universal Time) to compensate for variations in Earth's rotation speed. When a leap second is inserted, the normally nonexistent time 23:59:60 appears after 23:59:59. This 23:59:60 is an unexpected value for many software systems and has caused multiple large-scale outages in the past. During the 2012 leap second, a Linux kernel bug caused outages at Reddit, Mozilla, Yelp, and other services. Amazon Time Sync Service handles leap seconds through "smearing." Over the 12 hours before and after the leap second, the one second is evenly distributed as a gradual time adjustment. This means 23:59:60 never appears; instead, the one-second adjustment is spread over 24 hours. Each second becomes approximately 11.6 microseconds longer than usual, but this difference is negligible for most applications. Google's NTP servers also use smearing, but the smearing method (linear, cosine wave, etc.) differs, so mixing NTP servers with different smearing methods can cause time inconsistencies. Using only Time Sync Service is recommended in AWS environments.
ClockBound - Visualizing Time Uncertainty
ClockBound, open-sourced by AWS in 2021, is a daemon that provides the "uncertainty range" of the current time. Time synchronized via NTP contains errors due to network latency and clock drift. ClockBound calculates the upper and lower bounds of this error, providing information like "the true current time is within X plus or minus Y microseconds." This information can be used for transaction ordering in distributed databases. If the difference between two event timestamps falls within the uncertainty range, you cannot determine which occurred first. If it exceeds the uncertainty range, the order can be determined. It's a similar concept to Google Spanner's TrueTime API, but ClockBound is open source and can be used outside AWS environments. AWS managed services like DynamoDB and Aurora are internally designed to account for this kind of time uncertainty, but when building your own distributed systems, leveraging ClockBound can prevent data inconsistencies caused by time.
Real Problems Caused by Time Synchronization Failures
Time synchronization problems are difficult to diagnose because the symptoms are subtle and the root cause is hard to identify. Here are some actual problem patterns. First, TLS certificate validation failures. When an instance's clock drifts into the future, still-valid certificates are judged as "expired." When it drifts into the past, they're judged as "not yet valid." If HTTPS connections suddenly start failing, clock drift may be the cause. Second, AWS API authentication failures. SigV4 signatures include a timestamp, and requests are rejected when the difference between the request timestamp and the AWS server's time exceeds 5 minutes. Third, disordered log timelines. When aggregating logs from multiple instances, logs from instances with clock drift don't sort in chronological order, making incident investigation difficult. As a countermeasure, configure chrony (NTP client) on all EC2 instances and use Amazon Time Sync Service (169.254.169.123) as the sole time source. This is configured by default on Amazon Linux 2 and later. For a systematic study of time synchronization and distributed system design, specialized books on Amazon are a great resource.