Big Data Processing with Amazon EMR - Spark and Hive Execution Environments

Learn how to run Spark jobs and Hive queries on EMR clusters, choose between EMR Serverless, and optimize costs with managed scaling.

約 3 分で読めます最終更新: 2025-11-25

EMR Cluster Configuration

An EMR cluster can consist of up to hundreds of nodes, comprising master nodes (cluster management and YARN resource manager), core nodes (HDFS data storage and computation), and task nodes (computation only). Core nodes carry a risk of data loss when scaling down because they hold HDFS data, while task nodes hold no data and can be freely scaled. When using S3 as the primary storage, limit core node HDFS to temporary intermediate data storage and reduce costs by leveraging spot instances for task nodes. EC2 instance fleets allow you to specify multiple instance types to increase spot availability.

Running Spark and Hive

With Spark on EMR, you submit jobs using the spark-submit command or the EMR Steps API. EMRFS is a file system that optimizes reads and writes to S3, providing a consistent view that avoids S3's eventual consistency. Enabling Spark's Dynamic Resource Allocation automatically adjusts the number of executors based on job load. With Hive on EMR, you can configure the Glue Data Catalog as an external metastore, sharing table definitions with Athena and Redshift Spectrum. EMR Serverless eliminates the need for cluster provisioning, allowing you to specify resources at the application level and run jobs directly.

EMR on EKS and Managed Scaling

EMR on EKS runs Spark jobs on existing EKS clusters, leveraging Kubernetes resource management and scheduling. You create a virtual cluster mapped to an EKS namespace and submit jobs via the StartJobRun API. EMR on EC2's managed scaling automatically adds and removes core and task nodes based on job load. You set minimum and maximum node counts, and scaling decisions are made based on YARN memory utilization. EMR Studio is a browser-based IDE that connects to EMR clusters from Jupyter notebooks for interactive analysis. For comprehensive coverage of Spark design patterns, refer to technical books (Amazon).

EMR Cost Optimization

EMR costs consist of instance charges plus EMR charges (approximately 25% of EC2 pricing). The recommended configuration uses spot instances for task nodes while securing HDFS data safety with on-demand core nodes. With an EMRFS architecture using S3 as primary storage, you can minimize core node HDFS and mitigate spot interruption risks. Designing transient clusters that launch only for job execution and auto-terminate upon completion eliminates idle time costs. Graviton instances (m6g, r6g) are approximately 20% cheaper than equivalent x86 instances and are well-suited for Spark job execution.

Summary

EMR provides a managed execution environment for big data frameworks such as Spark and Hive. An EMRFS architecture with S3 as primary storage mitigates spot instance interruption risks, while managed scaling enables automatic node count adjustment based on job load. Integration with existing Kubernetes environments is also possible through EMR on EKS.

Practical Use Cases for Amazon Quick - Department-Specific Scenarios and Workflow Automation Design PatternsExplore concrete use cases for sales, IT, and finance departments, along with design patterns for notifications, approvals, and multi-step workflows using Quick Flows.BI Dashboard Visualization - Building a Data-Driven Decision Platform with Amazon QuickSightExplains how to build interactive BI dashboards with Amazon QuickSight and a serverless data analytics platform with Athena integration. Covers high-speed visualization with the SPICE engine and practical methods for sharing insights across the organization.Building Blockchain Networks - Leveraging Distributed Ledgers with Amazon Managed Blockchain and QLDBExplains how to build blockchain networks with Amazon Managed Blockchain and use Amazon QLDB as a verifiable ledger database. Covers practical use cases such as supply chain management and ensuring transparency in financial transactions.Privacy-Preserving Data Collaboration with AWS Clean RoomsRun joint analysis across multiple companies without sharing or copying data. Learn about aggregation rules for preventing individual identification and Cryptographic Computing for encrypted analysis.Customer Identity Unification - Resolving Scattered Customer Data with AWS Entity ResolutionLearn how to perform entity resolution (record matching) on customer data using AWS Entity Resolution. This article covers ML-based matching, rule-based matching, privacy protection, and integration with Clean Rooms.Leveraging Third-Party Data with AWS Data Exchange - Data Procurement and Subscription ManagementProcure third-party data products via Marketplace and build automated delivery pipelines to S3. This article also covers how to productize and monetize your own data.Building a Data Lake with Amazon S3 and Lake Formation - Design Patterns and GovernanceExplore data lake design patterns using S3 as the storage foundation and Lake Formation for fine-grained access control. This article also covers ETL pipelines and cost optimization.Data Lake Governance - Centralized Access Control with AWS Lake FormationLearn about building, access control, and governance for data lakes using AWS Lake Formation. This article covers fine-grained column-level and row-level permission management for S3-based data lakes, along with Glue and Athena integration.

EMR Cluster Configuration

Running Spark and Hive

EMR on EKS and Managed Scaling

EMR Cost Optimization

Summary

Related Services

Related Articles

More on This Topic

Similar Articles and Services