AWS Lake Formation

A service that centralizes data lake construction, management, and security, providing column-level and row-level fine-grained access control for data on S3

Overview

AWS Lake Formation centrally manages the construction of S3-based data lakes, data ingestion, catalog registration, and access control. Integrated with the Glue Data Catalog, it provides data source ingestion (blueprints), automatic schema detection, table/column/row-level access control, and cross-account data sharing. When querying the data lake from Athena, Redshift Spectrum, or EMR, access permissions are managed through Lake Formation's unified policy framework.

Fine-Grained Access Control Beyond IAM Policies

Attempting to manage data lake access control solely through S3 IAM policies and Glue Data Catalog resource policies leads to explosive policy complexity as tables and users grow. Lake Formation provides its own permission model that grants access at the database, table, column, and row levels, manageable through intuitive operations similar to SQL GRANT/REVOKE. Column-level filtering lets you expose all columns to User A while showing User B only non-PII columns from the same table. Row-Level Security lets you define data filters for controls like "sales department users can only see their own department's data." Cell-level security combines column and row filtering for the most granular control, such as masking specific columns in specific rows. These permissions are consistently enforced whether queries come from Athena, Redshift Spectrum, or EMR, preventing incidents where visible data varies depending on the access path.

Data Lake Construction Workflow and Blueprints

The standard workflow for building a data lake with Lake Formation consists of four steps: registering the data lake location, connecting data sources, ingesting data via blueprints, and configuring permissions. First, register an S3 bucket as the data lake location and delegate access permissions to Lake Formation. Next, configure connections to data sources (RDS, DynamoDB, on-premises databases, etc.). Blueprints are templates for common data ingestion patterns, available in two types: database snapshots (full load) and incremental ingestion. When a blueprint runs, it automatically generates Glue crawlers and ETL jobs behind the scenes, extracting data from the source, storing it in Parquet format on S3, and registering it as tables in the Glue Data Catalog. This dramatically reduces initial setup effort compared to building Glue jobs manually. However, for complex data transformations or custom logic, creating Glue jobs directly offers more flexibility than blueprints.

Cross-Account Data Sharing and Governed Tables

Lake Formation's cross-account data sharing grants table-level and column-level data access permissions to other accounts within Organizations or external accounts. Traditional cross-account access via S3 bucket policies couldn't enforce controls based on logical data structures (tables, columns), but Lake Formation enables granular sharing like "Account B can read only columns A, B, and C of Table X." It integrates with AWS RAM (Resource Access Manager), standardizing the sharing invitation and acceptance workflow. Governed Tables are a special table type managed by Lake Formation that support ACID transactions, automatic data compaction, and time-travel queries. Even when multiple ETL jobs write to the same table simultaneously, transaction isolation levels are guaranteed, preventing readers from seeing partially-written data. Integration with the Apache Iceberg table format is also advancing, allowing Lake Formation's permission model to be applied to Iceberg tables as well.

共有するXB!