AWS Batch

A fully managed service that automatically schedules hundreds of thousands of batch processing jobs and runs them on optimal computing resources

Overview

AWS Batch is a fully managed service that automates the planning, scheduling, and execution of batch computing workloads. When you submit jobs to a job queue, Batch automatically provisions the necessary computing resources (EC2 instances or Fargate), resolves job dependencies, and executes them in the optimal order. It is suited for workloads requiring massive parallel computation, such as genomic analysis, financial risk calculations, video encoding, and machine learning preprocessing. Integration with Spot Instances enables significant cost reduction.

EC2 and Fargate Compute Environments

AWS Batch compute environments define the infrastructure that runs your jobs. EC2-based environments let you specify instance types, minimum/maximum vCPUs, and whether to use Spot Instances. You can configure allocation strategies - BEST_FIT_PROGRESSIVE selects the lowest-cost instance type that fits the job requirements, while SPOT_CAPACITY_OPTIMIZED picks from pools with the highest Spot availability to minimize interruptions. Fargate-based environments require no instance management - you simply specify the vCPU and memory needed per job, and Batch handles provisioning and scaling automatically. The selection criteria in practice are straightforward: choose EC2 when GPUs, specific instance types, or high-performance computing with multi-node parallel jobs are required, and Fargate for everything else. When using Spot Instances with EC2 environments, Batch automatically detects Spot interruptions and reschedules jobs to other instances, enabling up to 90% cost savings for interruption-tolerant workloads. Azure Batch also supports container jobs, but it has no equivalent serverless option like Fargate and always requires managing pools of VMs.

Job Queues and Dependency Control

Job queues are the central scheduling mechanism in AWS Batch. Each queue is associated with one or more compute environments and has a priority value - when multiple queues share compute environments, higher-priority queues get resources first. A common pattern is to create separate queues for urgent jobs (high priority, On-Demand instances) and cost-optimized jobs (low priority, Spot Instances). Job dependencies let you define execution order: a job can depend on the successful completion of one or more predecessor jobs, enabling DAG (Directed Acyclic Graph) style workflows. Array jobs submit thousands of identical jobs with different parameters (identified by array index), which is ideal for embarrassingly parallel workloads like processing thousands of files or running parameter sweeps. Batch processing books (Amazon) are a helpful resource for deeper study. Job definitions allow you to configure retry strategies with configurable maximum retry counts and evaluate-on-exit rules that retry only on specific exit codes, preventing wasted compute on non-transient failures.

Step Functions Integration and Retry Strategies

While Batch's built-in job dependencies handle simple sequential and fan-out patterns, integrating with Step Functions unlocks more complex workflow orchestration. A common architecture detects files arriving in S3 via EventBridge, submits Batch jobs through a Step Functions state machine, monitors job completion, and triggers downstream processing or notifications. Step Functions' native Batch integration (the SubmitJob action with .sync suffix) waits for job completion and captures the exit status, eliminating the need for polling loops. For error handling, combine Batch-level retry strategies (automatic retries on transient failures) with Step Functions-level Catch and Retry blocks for workflow-level recovery - for example, retrying a failed job with increased memory or falling back to an alternative processing path. Multi-node parallel jobs distribute a single job across multiple EC2 nodes and support HPC workloads using MPI (Message Passing Interface), but these are not available in Fargate environments, so EC2 compute environments must be selected for HPC use cases.

共有するXB!