Amazon MWAA

A fully managed Apache Airflow service that orchestrates data pipelines using DAGs

Overview

Amazon MWAA (Managed Workflows for Apache Airflow) is a fully managed workflow orchestration service built on Apache Airflow. It eliminates the need to manage Airflow's scheduler, worker, and web server infrastructure, letting you focus on DAG (Directed Acyclic Graph) development and data pipeline operations. In addition to native integration with AWS services like Glue, EMR, Lambda, ECS, Redshift, and Athena, it supports third-party service integration through Airflow's provider packages.

Environment Classes and Worker Scaling

MWAA environments are available in five classes: mw1.small, mw1.medium, mw1.large, mw1.xlarge, and mw1.2xlarge. The class determines the CPU and memory for the scheduler and web server, directly affecting the number of DAGs and tasks that can run concurrently. For small-scale pipelines (fewer than 50 DAGs), mw1.small is sufficient, but running hundreds of DAGs in parallel requires mw1.large or above. Workers support autoscaling - set minimum and maximum worker counts, and they automatically scale out and in based on the number of queued tasks. The range is 1 to 25, minimizing queue wait times even during task burst execution. However, worker scale-out has a lead time of several minutes, so for predictable large batch processing, raising the minimum worker count in advance is an effective practice. Environment class changes are possible through an update operation, but updates incur several tens of minutes of downtime, so schedule them within a maintenance window.

DAG Deployment and S3 Sync Mechanics

MWAA DAG files are placed under a designated prefix in an S3 bucket. Python files uploaded to S3 are synced to the scheduler at approximately 30-second intervals, automatically parsed and registered as DAGs. DAG file updates are similarly reflected just by uploading to S3, so deployment pipelines only need to incorporate S3 sync. A practical workflow is running DAG syntax checks and unit tests in CodePipeline and CodeBuild, then deploying to S3 after tests pass. Placing a requirements.txt in S3 triggers pip install during environment updates, installing additional Python packages. However, requirements.txt changes require an environment update, so unlike DAG file updates, they don't take effect immediately. When using external libraries in DAGs, pin versions in requirements.txt and verify functionality after the environment update. A startup.sh script lets you define custom initialization processing at environment startup.

Plugin Management and VPC Network Design

MWAA custom plugins are uploaded to S3 as plugins.zip. Plugins can include custom Airflow operators, sensors, hooks, and web UI extensions. Packaging shared database connection logic or custom notification processing as plugins enables reuse across multiple DAGs. Plugin updates require an environment update, so like requirements.txt, applying them during a maintenance window is recommended. MWAA environments must be created within a VPC and require two or more private subnets. Web server access mode comes in two types: public and private. Public mode allows access to the Airflow UI over the internet, while private mode restricts access to within the VPC only. Worker communication with AWS services (S3, Glue, Redshift, etc.) requires VPC endpoints or a NAT gateway. From a cost optimization perspective, configuring VPC endpoints for frequently accessed S3 and CloudWatch Logs reduces NAT gateway data processing charges - an effective design choice.

共有するXB!