Running Apache Spark on Amazon EMR - Cluster Design and Cost Optimization for Big Data

Learn how to build Spark clusters with EMR, choose between EMR Serverless, and optimize costs with spot instances.

About 6 min readLast updated: 2026-05-08

EMR Overview

EMR is a service that runs over 20 big data frameworks including Apache Spark, Hive, Presto, and Flink on managed clusters. Clusters consist of master nodes, core nodes, and task nodes, with EMR automatically managing provisioning, configuration, and patching. EMR Serverless, which became generally available in 2022, completely eliminates cluster management. You simply submit Spark jobs and the necessary resources are automatically provisioned. A "compute-storage separation" architecture using S3 as storage is standard, enabling flexible lifecycle management where clusters can be stopped and recreated without HDFS dependency.

Cost Optimization

Cost optimization in EMR hinges on instance configuration design. Use on-demand instances for master and core nodes to ensure stability, and spot instances for task nodes. Since task nodes hold no data, there is no risk of data loss from spot interruptions. Specify multiple instance types with instance fleets to improve spot availability. EMR Serverless is well-suited for intermittent jobs, with charges incurred only during job execution. EMR on EC2 pricing includes EC2 instance costs plus an EMR management fee (approximately 25% of EC2 costs), so applying Savings Plans is effective for long-running clusters.

Spark Tuning and Data Formats

Spark job performance depends heavily on partition count, executor settings, and data format selection. Adjust spark.sql.shuffle.partitions based on data volume, changing it from the default of 200 to an appropriate value. Parquet format uses columnar storage that reads only the required columns, dramatically reducing scan volume and processing time compared to CSV. Enabling Adaptive Query Execution (AQE) automatically optimizes shuffle partition counts and join strategies based on runtime statistics. Table formats like Delta Lake and Apache Iceberg enable ACID transactions, time travel, and schema evolution on data lakes. From Spark 3.x onward, Dynamic Partition Pruning (DPP) automatically optimizes joins with partitioned tables, eliminating unnecessary partition reads. For practical EMR know-how, related books on Amazon are also worth checking.

EMR Cluster Cost Management

Instance fleets let you specify multiple instance types to maximize spot availability. Set a mix ratio of on-demand and spot instances, securing a baseline with on-demand while supplementing additional capacity with spot. EMR Serverless requires no cluster management and charges only for job execution time, making it ideal for sporadic batch processing. Apply Savings Plans to long-running clusters to take advantage of commitment discounts. Monitor YARN resource utilization with CloudWatch metrics to detect and right-size over-provisioned clusters. Enabling Managed Scaling automatically scales task nodes based on YARN metrics, adding resources only during peaks and shrinking during idle periods.

Choosing Between EMR on EC2, EMR Serverless, and EMR on EKS

EMR offers three deployment options, selected based on workload characteristics. EMR on EC2 is the traditional cluster mode, suited for fine-grained Spark configuration customization, library installation, and workloads requiring HDFS. EMR Serverless requires no cluster provisioning and is ideal for "submit and wait" workloads like daily batches or sporadic ETL jobs. It has startup overhead of tens of seconds, making it unsuitable for low-latency interactive queries. EMR on EKS runs Spark jobs on Kubernetes clusters, chosen when teams already operate EKS and want to unify infrastructure or share resources with other container workloads. Cost-wise, EMR Serverless tends to have lower per-GB costs for short-duration jobs, while EC2 clusters achieve lower costs for long-duration jobs through spot instances combined with Savings Plans.

Design Pitfalls and Operational Considerations

There are several common issues when running Spark on EMR. First, the Small File Problem (massive accumulation of small files) increases metadata processing overhead and inflates S3 GET request costs. Control output file counts with Spark's coalesce or repartition, or periodically merge small files using Iceberg's compaction feature. Second, for spot instance interruption mitigation, enable spark.speculation to speculatively execute delayed tasks, reducing recovery time for interrupted tasks. Data skew is another frequent issue - when data concentrates on specific partition keys, shuffles cause memory exhaustion or processing time imbalance. Address this with AQE's skew join optimization or salting techniques (appending random suffixes to keys for distribution). Finally, regarding S3 access through EMR's EMRFS, S3 switched to strong consistency in December 2020, so historical consistency concerns no longer apply.

Summary

EMR is a service for running big data frameworks in a managed environment. Adaptive Query Execution automatically optimizes Spark job performance, while Parquet format and proper partition design improve query efficiency. Instance fleets maximize spot availability, and EMR Serverless enhances cost efficiency for sporadic batch processing. Selecting among the three deployment options (EC2, Serverless, EKS) based on workload characteristics, and proactively addressing operational pitfalls like the Small File Problem and data skew through design, are key to stable large-scale data processing.

Practical Use Cases for Amazon Quick - Department-Specific Scenarios and Workflow Automation Design PatternsExplore concrete use cases for sales, IT, and finance departments, along with design patterns for notifications, approvals, and multi-step workflows using Quick Flows.BI Dashboard Visualization - Building a Data-Driven Decision Platform with Amazon QuickSightExplains how to build interactive BI dashboards with Amazon QuickSight and a serverless data analytics platform with Athena integration. Covers high-speed visualization with the SPICE engine and practical methods for sharing insights across the organization.Building Blockchain Networks - Leveraging Distributed Ledgers with Amazon Managed Blockchain and QLDBExplains how to build blockchain networks with Amazon Managed Blockchain and use Amazon QLDB as a verifiable ledger database. Covers practical use cases such as supply chain management and ensuring transparency in financial transactions.Privacy-Preserving Data Collaboration with AWS Clean RoomsRun joint analysis across multiple companies without sharing or copying data. Learn about aggregation rules for preventing individual identification and Cryptographic Computing for encrypted analysis.Customer Identity Unification - Resolving Scattered Customer Data with AWS Entity ResolutionLearn how to perform entity resolution (record matching) on customer data using AWS Entity Resolution. This article covers ML-based matching, rule-based matching, privacy protection, and integration with Clean Rooms.Leveraging Third-Party Data with AWS Data Exchange - Data Procurement and Subscription ManagementProcure third-party data products via Marketplace and build automated delivery pipelines to S3. This article also covers how to productize and monetize your own data.Building a Data Lake with Amazon S3 and Lake Formation - Design Patterns and GovernanceExplore data lake design patterns using S3 as the storage foundation and Lake Formation for fine-grained access control. This article also covers ETL pipelines and cost optimization.Data Lake Governance - Centralized Access Control with AWS Lake FormationLearn about building, access control, and governance for data lakes using AWS Lake Formation. This article covers fine-grained column-level and row-level permission management for S3-based data lakes, along with Glue and Athena integration.

EMR Overview

Cost Optimization

Spark Tuning and Data Formats

EMR Cluster Cost Management

Choosing Between EMR on EC2, EMR Serverless, and EMR on EKS

Design Pitfalls and Operational Considerations

Summary

Related Services

Related Articles

More on This Topic

Similar Articles and Services