Amazon EMR

A managed big data platform that runs open-source frameworks like Apache Spark, Hive, and Presto to process and analyze petabyte-scale data

Overview

Amazon EMR (Elastic MapReduce) is a fully managed big data processing service that runs open-source frameworks including Apache Spark, Hadoop, Hive, Presto, and HBase. In addition to the traditional model of building clusters on EC2 instances, it offers three deployment options: EMR on EKS for running on existing EKS clusters, and EMR Serverless for serverless execution. By leveraging S3 as a data lake and separating compute from storage, you can terminate clusters when processing is complete to minimize costs.

Three Deployment Models and Selection Criteria

EMR on EC2 is the most flexible option, offering fine-grained control over instance types, cluster configuration, and bootstrap actions. It's ideal when you need specific Spark or Hive versions, custom AMIs, or local storage via HDFS. EMR on EKS runs Spark jobs on existing EKS clusters, leveraging Kubernetes resource management and multi-tenancy capabilities. It suits organizations where multiple teams share a single cluster while isolating jobs through namespaces. EMR Serverless, which became generally available in 2022, is the newest option and eliminates all cluster management. You simply submit jobs and the required resources are automatically provisioned and released upon completion. For batch ETL and ad-hoc analytics that don't require always-on clusters, it delivers the best cost efficiency. As a rule of thumb: choose EC2 when you need fine-tuned configuration, EKS when you need Kubernetes integration, and Serverless when you want to minimize operational overhead.

Spark Job Optimization and Cost Control

When running Spark on EMR, instance configuration and partition design have the greatest impact on performance and cost. The standard approach is to assign memory-optimized r-series instances to driver nodes and cost-optimized m-series instances to worker nodes. Using Spot Instances for worker nodes can reduce costs by 60-90% compared to On-Demand, but you need to specify multiple instance types via Instance Fleets to mitigate interruption risk. Spark's shuffle partition count (spark.sql.shuffle.partitions) should be tuned from the default of 200 based on data volume - targeting 128-256 MB per partition strikes a good balance between memory efficiency and task parallelism. For reads from S3, using columnar formats like Parquet or ORC and partitioning data by date or category to enable partition pruning is critical. EMR Runtime for Apache Spark delivers up to 3.5x faster performance over open-source Spark at no additional cost. For a systematic study of big data processing and analytics, books on big data (Amazon) are a great resource.

Data Lake Architecture and When to Choose Glue Instead

EMR and Glue can both execute Spark-based ETL processing, but they serve different roles. Glue is serverless with low operational overhead and integrates the Glue Data Catalog for metadata management and crawlers for automatic schema detection. It's well-suited for routine ETL pipelines and use cases where Data Catalog integration is important. EMR, on the other hand, supports diverse frameworks beyond Spark - including Hive, Presto, HBase, and Flink - and offers greater flexibility in cluster configuration, making it better for complex analytical processing and interactive queries. In practice, a common pattern is to handle daily routine ETL with Glue while using EMR for ad-hoc large-scale analytics and machine learning preprocessing. For building a data lake, the standard architecture is to store raw data in S3, detect schemas with Glue crawlers and register them in the Data Catalog, then query from EMR or Athena. Using table formats like Apache Iceberg or Delta Lake on EMR enables ACID transactions and time-travel queries.

共有するXB!