Amazon EMR Specialized2009年〜
A managed service for running big data frameworks like Apache Spark, Hive, and Presto
What It Does
Amazon EMR (Elastic MapReduce) runs open-source big data frameworks like Apache Spark, Hive, Presto, and HBase on managed clusters. It processes petabyte-scale data by automatically provisioning clusters of EC2 instances. AWS manages cluster setup, configuration, and tuning.
Use Cases
It is used for large-scale log analysis, ETL (extract, transform, load) processing, machine learning data preprocessing, interactive queries on data lakes, genome analysis, and other big data workloads.
Everyday Analogy
Think of it like a large factory production line. To process massive amounts of raw materials (data), it automatically lines up and runs the necessary number of machines (nodes). After processing is complete, the machines are put away, so you only pay for what you used.
What Is EMR?
Amazon EMR is a managed cluster service for big data processing. You can launch a cluster and run jobs in minutes without the hassle of installing and configuring frameworks like Apache Spark or Hive yourself. A common architecture uses S3 as a data lake with compute and storage separated.
Cluster Configuration
EMR clusters consist of three node types: primary, core, and task. The primary node manages the cluster and coordinates jobs. Core nodes handle data storage and processing. Task nodes handle processing only and can use Spot Instances to reduce costs. Auto Scaling can automatically adjust node counts based on load.
EMR on EKS and EMR Serverless
In addition to EC2-based clusters, EMR offers EMR on EKS for running Spark jobs on Kubernetes, and EMR Serverless which eliminates cluster management entirely. EMR on EKS is suited for integrating big data processing into existing Kubernetes environments. EMR Serverless automatically provisions resources when you submit a job, offering the simplest experience. For organizing concepts and techniques around EMR on EKS and EMR Serverless, specialized books (Amazon) are a handy resource.
Getting Started
Select 'Create cluster' in the EMR console and specify applications (Spark, Hive, etc.), instance types, and node counts. The basic workflow is submitting Spark jobs against data in S3 and outputting results to S3. Setting clusters to auto-terminate after processing prevents unnecessary costs.
Things to Watch Out For
- Leaving clusters running incurs ongoing instance charges, so auto-termination after job completion is recommended
- Using Spot Instances for task nodes can significantly reduce costs, but job design must account for interruptions
- For simple use cases, EMR Serverless may be more cost-effective since it eliminates cluster management