Running Spark Jobs Serverlessly with Amazon EMR Serverless - Big Data Processing Without Cluster Management

Learn about running Spark and Hive jobs with EMR Serverless, job run design, and cost optimization strategies.

EMR Serverless Overview

EMR Serverless is a big data processing service that runs Spark and Hive jobs serverlessly, auto-scaling up to 400 vCPU of worker resources. While EMR on EC2 requires configuring instance types, node counts, and Auto Scaling settings, EMR Serverless automatically provisions resources when you simply submit a job.

Job Runs and Cost Optimization

You create an application and submit Spark or Hive job runs. Job runs specify maximum vCPU and memory resources, and you are charged only for actual usage. Pre-initialized workers pool workers in the application in advance, reducing job startup time to seconds. You can run Spark SQL queries against S3 data lakes, referencing table definitions from the Glue Data Catalog. Combined with the Iceberg table format, you can achieve ACID transactions and time travel queries.

Application Design and Hive Integration

EMR Serverless applications are created by selecting either a Spark or Hive runtime. For Spark applications, you place PySpark scripts in S3 and execute them via job runs. For Hive applications, you write ETL processing in HiveQL scripts and use the Glue Data Catalog as the metastore. Configuring pre-initialized workers avoids cold starts at job launch, enabling jobs to begin within seconds. You can individually specify vCPU and memory for job run drivers and executors, allowing resource allocation tailored to job characteristics. Storing S3 data in Parquet format with partitioning optimizes query performance. For Spark use cases and practical insights, related books (Amazon) are a helpful reference.

EMR Serverless Pricing

EMR Serverless uses pay-per-use pricing based on vCPU-hours and memory GB-hours. vCPU costs approximately 0.052 USD per hour, and memory costs approximately 0.0057 USD per GB-hour. Since there are no charges when jobs are not running, cost efficiency significantly improves over EMR on EC2 for sporadic batch processing. Pre-initialized workers incur charges even when idle, so enable or disable them based on job frequency. Set resource limits on job runs to control runaway job costs, and use timeouts for automatic termination. The break-even point with EMR on EC2 tends to favor Serverless when cluster utilization falls below approximately 30%.

Summary

EMR Serverless is a service that runs Spark and Hive jobs without cluster management. Pay-per-use pricing eliminates idle costs, and pre-initialized workers avoid cold starts. Using the Glue Data Catalog as a metastore enables efficient ETL processing on Parquet data in S3. It is more cost-effective than EMR on EC2 in environments where cluster utilization falls below 30%.