Running Apache Spark on Amazon EMR - Cluster Design and Cost Optimization for Big Data
Learn how to build Spark clusters with EMR, choose between EMR Serverless, and optimize costs with spot instances.
EMR Overview
EMR is a service that runs over 20 big data frameworks including Apache Spark, Hive, Presto, and Flink on managed clusters. Clusters consist of master nodes, core nodes, and task nodes, with EMR automatically managing provisioning, configuration, and patching. EMR Serverless, which became generally available in 2022, completely eliminates cluster management. You simply submit Spark jobs and the necessary resources are automatically provisioned.
Cost Optimization
Cost optimization in EMR hinges on instance configuration design. Use on-demand instances for master and core nodes to ensure stability, and spot instances for task nodes. Since task nodes hold no data, there is no risk of data loss from spot interruptions. Specify multiple instance types with instance fleets to improve spot availability. EMR Serverless is well-suited for intermittent jobs, with charges incurred only during job execution.
Spark Tuning and Data Formats
Spark job performance depends heavily on partition count, executor settings, and data format selection. Adjust spark.sql.shuffle.partitions based on data volume, changing it from the default of 200 to an appropriate value. Parquet format uses columnar storage that reads only the required columns, dramatically reducing scan volume and processing time compared to CSV. Enabling Adaptive Query Execution (AQE) automatically optimizes shuffle partition counts and join strategies based on runtime statistics. Table formats like Delta Lake and Apache Iceberg enable ACID transactions, time travel, and schema evolution on data lakes. For practical EMR know-how, related books on Amazon are also worth checking.
EMR Cluster Cost Management
Instance fleets let you specify multiple instance types to maximize spot availability. Set a mix ratio of on-demand and spot instances, securing a baseline with on-demand while supplementing additional capacity with spot. EMR Serverless requires no cluster management and charges only for job execution time, making it ideal for sporadic batch processing. Apply Savings Plans to long-running clusters to take advantage of commitment discounts. Monitor YARN resource utilization with CloudWatch metrics to detect and right-size over-provisioned clusters.
Summary
EMR is a service for running big data frameworks in a managed environment. Adaptive Query Execution automatically optimizes Spark job performance, while Parquet format and proper partition design improve query efficiency. Instance fleets maximize spot availability, and EMR Serverless enhances cost efficiency for sporadic batch processing.