Why Auto Scaling Scales Out Fast but Scales In Cautiously - The Design Intent Behind Asymmetric Decision Logic
This article explains why EC2 Auto Scaling executes scale-out immediately while applying a cooldown period for scale-in, the flapping prevention mechanism, and the internal logic of target tracking scaling.
The Asymmetry Between Scale-Out and Scale-In
In EC2Auto Scaling's default settings, the scale-out (adding instances) cooldown period is 0 seconds (immediate execution), while the scale-in (removing instances) cooldown period is 300 seconds (5 minutes). This asymmetric configuration has a clear design intent. A delayed scale-out directly impacts users. If traffic surges but instances are not added, response times degrade, and in the worst case, the service goes down. Therefore, scale-out should execute as quickly as possible. On the other hand, if scale-in happens too quickly, flapping occurs - the frequent repetition of scaling out and scaling in. Traffic temporarily decreases, instances are removed, and then traffic increases again, requiring instances to be added back. Since instance startup takes several minutes, flapping causes both performance degradation and increased costs simultaneously.
How Cooldown Periods Work
A cooldown period is the time after a scaling action during which subsequent scaling actions are suppressed. During a scale-out cooldown, additional scale-outs are suppressed, but scale-in can still execute. Conversely, during a scale-in cooldown, additional scale-ins are suppressed, but scale-out can still execute. This design ensures that the situation of "scale-out is needed but cannot execute because we're in a scale-in cooldown" never occurs. The optimal cooldown value depends on workload characteristics. If it takes 3 minutes from EC2 instance launch to passing the ELB health check, the scale-out cooldown should be set to at least 3 minutes. If the cooldown is too short, the system judges "still not enough" while new instances have not yet started processing traffic, resulting in excessive scale-out. When using target tracking scaling policies, cooldown periods are managed automatically, so manual configuration is unnecessary.
Internal Logic of Target Tracking Scaling
Target tracking scaling is the most recommended scaling policy, automatically adjusting instance count to maintain a specified metric at a target value. For example, if you set the CPU utilization target to 50%, Auto Scaling increases or decreases instances to maintain CPU utilization at 50%. Internally, target tracking scaling operates with an algorithm similar to a PID controller (proportional-integral-derivative control). It calculates the required number of instances based on the difference (deviation) between the current metric value and the target value. The larger the deviation, the more instances are added or removed at once. A key characteristic of target tracking scaling is that it internally creates different alarms for scale-out and scale-in. The scale-out alarm fires when the threshold is exceeded 3 consecutive times over a 3-minute evaluation period, while the scale-in alarm fires when the metric falls below the threshold 15 consecutive times over a 15-minute evaluation period. This asymmetric evaluation period is what achieves "fast scale-out, cautious scale-in."
Instance Selection Logic During Scale-In
When scale-in executes, which instance gets terminated is determined by the default termination policy. The default logic has 3 stages. First, it selects the AZ with the most instances. This maintains instance count balance across AZs. Second, within that AZ, it selects the instance using the oldest launch configuration or launch template. This prioritizes removal of instances with older configurations, facilitating migration to newer configurations. Third, if multiple instances share the same launch configuration, it selects the instance closest to the next billing hour. Since EC2 introduced per-second billing, this criterion has little practical significance, but the logic remains. Custom termination policies are also available. You can choose policies such as NewestInstance (terminate the newest instance), OldestInstance (terminate the oldest instance), and ClosestToNextInstanceHour (terminate the instance closest to the next billing hour).
Predictive Scaling - Forecasting the Future from Past Patterns
Predictive scaling, introduced in 2021, uses machine learning to analyze the past 14 days of traffic patterns, predict future traffic, and pre-provision instances accordingly. For example, if traffic surges every morning at 9 AM, predictive scaling begins adding instances around 8:50 AM to prepare for the 9 AM spike. With reactive scaling (adding instances only after traffic increases), the several minutes required for instance startup and ELB registration cause performance degradation during the initial traffic surge. Predictive scaling bridges this gap. Predictive scaling is recommended for use alongside target tracking scaling. Predictive scaling pre-provisions the "expected baseline" while target tracking scaling handles "unexpected fluctuations" - a clear division of responsibilities. The accuracy of predictive scaling depends on the regularity of traffic patterns. Workloads that repeat the same pattern daily achieve high accuracy, but irregular traffic patterns may result in inaccurate predictions. For a systematic approach to scaling design patterns, specialized books on Amazon are a helpful reference.