AWS Glue
A serverless, scalable ETL service that provides integrated metadata management through Data Catalog and an Apache Spark-based job execution platform.
Overview
AWS Glue is a fully managed service that runs data extraction, transformation, and loading (ETL) in a serverless manner. Glue Data Catalog centrally manages metadata for data lakes and data warehouses, serving as a shared schema definition referenced by query engines such as Athena, Redshift Spectrum, and EMR. Crawlers automatically scan data sources to infer and register schemas, eliminating the need for manual table definitions. The job execution platform offers two types - Apache Spark and Python Shell - covering a wide range of workloads from large-scale distributed processing to lightweight script execution.
How Data Catalog and Crawlers Transform Metadata Management
At the core of Glue is the Data Catalog. It aggregates metadata from scattered data sources - Parquet files on S3, RDS tables, Redshift schemas - into a single location and exposes them through a Hive Metastore-compatible interface. When you write a query in Athena, the table definition specified in the FROM clause is essentially a Data Catalog entry. Crawlers periodically scan data sources and automatically infer column names, data types, and partition structures, registering them in the Catalog. However, inference accuracy is not perfect - there are cases where the first row of a CSV is misidentified as data rather than a header, or where types fluctuate for columns containing a mix of numbers and strings. In practice, the standard approach is to visually verify the inference results after the initial Crawler run and manually correct types as needed. Custom Classifiers can be defined to handle proprietary file formats. Azure Data Factory, the corresponding Azure service, also provides metadata management capabilities, but the deep integration where Athena and Redshift Spectrum can directly reference the Glue Data Catalog is a strength unique to AWS.
Spark Jobs and Python Shell - Key Considerations for DPU Sizing
Glue jobs come in two types: Apache Spark-based ETL jobs and Python Shell jobs that run on a single node. Spark jobs are suited for distributed processing of data ranging from hundreds of gigabytes to terabytes, with cluster size controlled by the number of allocated DPUs (Data Processing Units). One DPU is equivalent to 4 vCPUs + 16 GB of memory, and Standard Workers start from a minimum of 2 DPUs. While adding more DPUs speeds up processing, billing is based on DPU count multiplied by execution time, so over-allocation directly translates to wasted costs. In practice, the efficient approach is to start with 2 DPUs and gradually scale up while monitoring CloudWatch metrics for executor memory utilization and shuffle spill. Python Shell jobs, on the other hand, run on 1 DPU or 0.0625 DPU and are suited for tasks like API calls and lightweight file transformations where Spark's overhead is unnecessary. Related books on ETL design (Amazon) provide systematic coverage of how to select the right job type based on processing scale.
Job Bookmarks and Data Pipeline Idempotency
One of the trickiest problems in ETL pipelines is duplicate processing - processing the same data twice. Glue's Job Bookmark mechanism solves this by recording how far the previous job execution processed data, so the next run targets only unprocessed records. For S3 sources, it records file paths and timestamps; for JDBC sources, it records the maximum primary key value, enabling incremental processing. However, Job Bookmarks work effectively only when data is appended. When existing files are overwritten or partitions are fully replaced, Bookmarks alone cannot guarantee consistency. In such cases, you combine approaches like moving processed files to a different S3 prefix or recording processed keys in DynamoDB to manage idempotency yourself. Glue Workflows let you chain multiple crawlers and jobs with dependency relationships, and combining them with Step Functions enables building complex data pipelines with error handling and conditional branching.