AWS Glue のアイコン

AWS Glue Popular2016年〜

A serverless ETL service that automates data extraction, transformation, and loading

What It Does

AWS Glue is a serverless service that automates the process of extracting data from various sources, transforming it into analysis-ready formats, and loading it into data warehouses or data lakes (ETL). No server setup or management is needed, and its data catalog feature automatically discovers and manages data locations and schemas. It includes an Apache Spark-based processing engine for fast parallel execution of large-scale data transformations.

Use Cases

It is used for converting raw data accumulated in S3 and integrating it into data lakes, building data integration pipelines from multiple databases and applications, data cleansing and normalization before analysis with Redshift or Athena, and converting log data to Parquet format to optimize query costs - widely used wherever data preprocessing and integration are needed.

Everyday Analogy

Think of it like a food processing factory. Vegetables and fruits (raw data) arriving from farms (data sources) come in all shapes and sizes. The processing factory (Glue) automatically washes, cuts, and packages them so they're ready to be neatly arranged on supermarket shelves (data warehouse). You only pay for the time the factory equipment is in use, with zero maintenance costs when idle.

What Is Glue?

AWS Glue is a serverless data integration service launched in 2017. ETL stands for Extract, Transform, Load - the process of collecting data scattered across different systems, formatting it for analysis, and storing it. Traditional ETL tools required significant effort for server setup, operations, and job scheduling, but Glue handles all of this, letting developers focus on data transformation logic.

Data Catalog

One of Glue's core features is the Data Catalog. Crawlers automatically scan data sources like S3, RDS, and Redshift, registering table definitions and schema information in the catalog. This catalog is also referenced by other AWS analytics services like Athena, Redshift Spectrum, and EMR, functioning as a centralized metadata store for data locations and structures. When data is added or changed, crawlers detect the differences and automatically update the catalog.

How ETL Jobs Work

Glue ETL jobs run on a distributed processing engine based on Apache Spark. You can write job scripts in Python (PySpark) or Scala, or use Glue Studio's visual editor to build ETL pipelines with drag-and-drop without writing code. Computing resources are automatically allocated when jobs run and released when complete. Pricing is based on DPU (Data Processing Unit) usage time. For a comprehensive overview of how ETL jobs work, specialized books (Amazon) are a helpful reference.

Job Scheduling and Workflows

Glue includes a built-in scheduler for managing job execution schedules and a workflow feature for chaining multiple jobs. You can set up periodic job execution with cron expressions or define dependencies so the next job runs only when the previous one succeeds. Using triggers, you can build event-driven pipelines that automatically start ETL jobs when files arrive in S3.

Getting Started

To get started, create a crawler in the Glue console, specify a data source like an S3 bucket, and run it. Once the crawler detects data and registers tables in the catalog, create a visual ETL job in Glue Studio. Place source (input), transform (filtering, column changes), and target (output destination) as nodes and connect them to complete your ETL pipeline. Verify results with a test run, and if everything looks good, set up a schedule for periodic execution.

Things to Watch Out For

  • Pricing is based on DPU usage time, so costs increase with large data transformations. Use small datasets for development and testing
  • Running crawlers too frequently increases costs. Set appropriate schedules based on data update frequency
  • Glue Studio's visual editor enables no-code ETL, but for complex transformation logic, writing PySpark scripts directly is more appropriate
共有するXB!