Amazon EMR Serverless Specialized2022年〜
A serverless service for running Spark and Hive jobs without cluster management
What It Does
Amazon EMR Serverless lets you run Apache Spark and Hive jobs without managing clusters. Simply submit a job and resources are automatically provisioned. You are charged only for execution time and resource usage. No cluster sizing or scaling configuration is needed.
Use Cases
It is used for periodic ETL batch processing, ad-hoc queries on S3 data lakes, and running Spark jobs as part of data pipelines - anywhere you want big data processing without cluster management overhead.
Everyday Analogy
Think of it like a taxi. While EMR (cluster version) is like buying and maintaining your own car, EMR Serverless is like hailing a taxi that takes you to your destination. No vehicle management needed - you just pay for the ride.
What Is EMR Serverless?
Amazon EMR Serverless is a service for running big data processing in a serverless manner. With EMR on EC2, you need to decide instance types and node counts, but with EMR Serverless, resources are automatically allocated when you submit a job. Resources are released after job completion, so there are no idle costs.
Applications and Job Runs
In EMR Serverless, you first create an application and select a Spark or Hive runtime. When you submit a job run to the application, the necessary vCPU and memory are automatically provisioned. Configuring pre-initialized workers pools workers in advance, reducing job startup time to seconds. For detailed coverage of applications and job runs, related books (Amazon) are also available.
Getting Started
Select 'Create serverless application' in the EMR console and specify the runtime (Spark / Hive). Once the application is created, submit a job run specifying scripts and data on S3. Integrating with Glue Data Catalog lets you share table metadata and run queries.
Things to Watch Out For
- Pay-per-use pricing based on execution time and resource usage makes it cost-effective for short batch jobs
- For long-running interactive workloads, EMR on EC2 may be more cost-effective