Building an HPC Environment with AWS ParallelCluster - Automated Slurm Cluster Provisioning and Scaling

Automatically provision HPC clusters with CloudFormation and manage jobs with the Slurm scheduler. Also covers cost optimization with Spot Instances.

Overview of ParallelCluster

ParallelCluster is an open-source HPC cluster management tool provided by AWS. You define cluster configuration (instance types, node count, storage, networking) in a YAML configuration file and automatically provision it as a CloudFormation stack with the pcluster create-cluster command. The Slurm job scheduler is configured by default, allowing you to use existing Slurm job scripts as-is. It is used for large-scale parallel computing workloads including computational fluid dynamics (CFD), molecular dynamics, genome analysis, weather simulation, and financial risk calculation.

Auto Scaling and Cost Optimization

ParallelCluster's auto scaling works in conjunction with Slurm's job queue. When jobs are submitted, compute nodes automatically launch, and when nodes remain idle for a specified period after job completion, they automatically terminate. During periods with no jobs, the compute node count drops to zero, with only the head node incurring charges. Using Spot Instances can reduce HPC workload costs by up to 90%. Specifying multiple instance types in the capacity type configuration improves Spot availability. For checkpoint-capable applications, you can configure automatic job requeuing when Spot interruptions occur.

EFA and Shared Storage

EFA (Elastic Fabric Adapter) is a high-speed network interface for HPC workloads, providing 100 Gbps inter-node communication. It is effective for MPI (Message Passing Interface) based parallel computing where inter-node data exchange becomes a bottleneck. Simply enabling EFA in the ParallelCluster configuration automatically provisions EFA on compute nodes. For shared storage, you can choose from FSx for Lustre, EFS, and EBS. FSx for Lustre provides throughput of up to hundreds of GB/s, making it ideal for parallel reads of large datasets. Integration with S3 automatically imports data from S3 into the Lustre file system and exports computation results back to S3. For a comprehensive guide to ParallelCluster architecture, technical books (Amazon) are a useful reference.

Summary

ParallelCluster is a tool that automatically provisions Slurm-based HPC clusters on AWS. Auto scaling provides resource management aligned with job demand, Spot Instances reduce costs, and EFA delivers high-speed inter-node communication. It is ideal for migrating from on-premises HPC clusters or handling burst computing demands.