Amazon EMR Serverless のアイコン

Amazon EMR Serverless Specialized2022年〜

A serverless service for running Spark and Hive jobs without cluster management

What It Does

Amazon EMR Serverless lets you run Apache Spark and Hive jobs without managing clusters. Simply submit a job and resources are automatically provisioned. You are charged only for execution time and resource usage. No cluster sizing or scaling configuration is needed.

Use Cases

It is used for periodic ETL batch processing, ad-hoc queries on S3 data lakes, and running Spark jobs as part of data pipelines - anywhere you want big data processing without cluster management overhead.

Everyday Analogy

Think of it like a taxi. While EMR (cluster version) is like buying and maintaining your own car, EMR Serverless is like hailing a taxi that takes you to your destination. No vehicle management needed - you just pay for the ride.

What Is EMR Serverless?

Amazon EMR Serverless is a service for running big data processing in a serverless manner. With EMR on EC2, you need to decide instance types and node counts, but with EMR Serverless, resources are automatically allocated when you submit a job. Resources are released after job completion, so there are no idle costs.

Applications and Job Runs

In EMR Serverless, you first create an application and select a Spark or Hive runtime. When you submit a job run to the application, the necessary vCPU and memory are automatically provisioned. Configuring pre-initialized workers pools workers in advance, reducing job startup time to seconds. For detailed coverage of applications and job runs, related books (Amazon) are also available.

Getting Started

Select 'Create serverless application' in the EMR console and specify the runtime (Spark / Hive). Once the application is created, submit a job run specifying scripts and data on S3. Integrating with Glue Data Catalog lets you share table metadata and run queries.

Things to Watch Out For

  • Pay-per-use pricing based on execution time and resource usage makes it cost-effective for short batch jobs
  • For long-running interactive workloads, EMR on EC2 may be more cost-effective
共有するXB!