Amazon EMR Serverless

A serverless big data processing service that runs Apache Spark and Hive jobs without cluster management, charging only for the vCPU and memory time consumed during processing

Overview

Amazon EMR Serverless is a serverless big data processing service that runs Apache Spark and Hive jobs. It eliminates the need for cluster provisioning, scaling, and patching - you simply submit jobs and the required resources are automatically provisioned. After processing completes, resources are released and you are charged only for the vCPU-hours and memory GB-hours actually consumed. EMR Runtime for Apache Spark delivers up to 3.5x faster performance compared to open-source Spark at no additional cost.

Application and Job Execution Lifecycle

In EMR Serverless, you first create an application. An application is a container that defines a Spark or Hive runtime environment, specifying an EMR release version (emr-6.x, emr-7.x). Applications have three states - CREATED, STARTED, and STOPPED - and accept jobs only in the STARTED state. Jobs are submitted via the StartJobRun API, specifying the S3 path to a JAR file or Python script along with the entry point for Spark jobs. Once submitted, EMR Serverless automatically provisions workers and begins processing. Worker startup time is typically 15-30 seconds, but configuring pre-initialized workers reduces this to near zero. After job completion, workers are automatically released. With the auto-stop feature enabled, the application itself transitions to STOPPED after a configurable idle period, completely eliminating standby costs.

Worker Configuration and Cost Control

EMR Serverless workers are defined by vCPU and memory combinations that can be customized to match job characteristics. Defaults allocate 4 vCPU / 16 GB for both Spark drivers and executors, but memory-intensive workloads can be tuned to configurations like 4 vCPU / 30 GB. Setting a maximum worker count caps resource consumption per job. The key to cost control is the application-level maximum resource setting. The maxCapacity parameter specifies vCPU and memory ceilings, ensuring total resources across all jobs never exceed these limits. Pricing is approximately $0.052 USD per vCPU-hour and $0.0057 USD per memory GB-hour. While the per-unit cost is higher than EMR on EC2 on-demand pricing, the zero idle-time cost often results in lower total costs for batch processing and ad-hoc analytics. Job execution logs are automatically written to S3, and the Spark UI can be reconstructed from these logs for post-hoc analysis. For a deeper dive into big data processing architectures, related books (Amazon) are a helpful resource.

S3 Data Lake Integration Patterns

The typical EMR Serverless architecture uses S3 as a data lake with Spark jobs performing ETL processing. Input data resides in S3 as Parquet, ORC, CSV, or JSON files, with Glue Data Catalog serving as the metastore. Spark SQL can directly reference Glue Data Catalog tables, with partition pruning and predicate pushdown applied automatically. Apache Iceberg table read/write support enables building robust data pipelines with ACID transactions, schema evolution, and time-travel queries. A widely adopted production pattern combines EMR Serverless with Step Functions to orchestrate daily ETL pipelines. A Step Functions task state calls the StartJobRun API, waits for job completion, then proceeds to the next step. Adding an EventBridge Scheduler for daily triggers completes a fully serverless scheduled batch processing pipeline.

共有するXB!