Building a Data Lake with Amazon S3 and Lake Formation - Design Patterns and Governance

Explore data lake design patterns using S3 as the storage foundation and Lake Formation for fine-grained access control. This article also covers ETL pipelines and cost optimization.

約 3 分で読めます最終更新: 2025-11-14

Data Lake Design Patterns

A data lake uses S3 as its storage foundation, offering 99.999999999% (eleven nines) durability, and manages data progressively from raw ingestion to analysis-ready state. The landing zone (Raw) stores ingested raw data as-is, the staging zone (Processed) applies type conversions and cleansing via Glue jobs, and the curated zone (Curated) holds analysis-ready data in Parquet format with business logic applied. The S3 prefix design adopts a year/month/day partition structure, enabling partition pruning in Athena queries to reduce scan volume.

Governance with Lake Formation

Lake Formation is a service that centrally manages data lake access control. Previously, you had to configure S3 bucket policies, IAM policies, and Glue Catalog policies separately, but Lake Formation provides GRANT/REVOKE-based permission management at the database, table, column, and row levels. With Tag-Based Access Control (LF-TBAC), you can assign classification tags to data and automatically apply access rights based on those tags. Cross-account sharing allows you to grant table-level access to other accounts within Organizations, enabling a data mesh architecture.

ETL Pipeline Design

Data lake ETL pipelines are built with Glue jobs. Glue Crawlers detect schemas from raw data in the landing zone and register them in the Data Catalog. Glue jobs perform type conversions, missing value handling, and deduplication, outputting data in Parquet format to the curated zone. Set partition keys (date, region) to optimize Athena query performance. Glue Workflows define dependencies between multiple jobs, controlling the sequence of crawler, ETL job, and data quality check execution. EventBridge triggers the ETL pipeline automatically when data arrives in S3, enabling near-real-time data updates. For a systematic study of Lake Formation from basics to advanced topics, check out books on Amazon.

Data Lake Cost Optimization

Optimize data lake costs by leveraging S3 storage classes. Store raw data in the landing zone in S3 Standard and set a lifecycle rule to transition to S3 Intelligent-Tiering after 30 days. Keep curated data in Standard since it is frequently queried, and transition archive zone data to Glacier Instant Retrieval. Athena query costs can be dramatically reduced with Parquet format and proper partition design - in some cases reducing scan volume by over 90% compared to CSV. Set appropriate DPU counts for Glue jobs to avoid over-provisioning. Use S3 Storage Lens to visualize cost breakdowns per bucket and periodically review for unnecessary data deletion or storage class changes.

Summary

A data lake combining S3 and Lake Formation progressively improves data quality through a three-zone design and ensures governance with Lake Formation's fine-grained access control. Automatic schema detection with Glue Crawlers and columnar format adoption achieve both operational efficiency and query performance.

Practical Use Cases for Amazon Quick - Department-Specific Scenarios and Workflow Automation Design PatternsExplore concrete use cases for sales, IT, and finance departments, along with design patterns for notifications, approvals, and multi-step workflows using Quick Flows.BI Dashboard Visualization - Building a Data-Driven Decision Platform with Amazon QuickSightExplains how to build interactive BI dashboards with Amazon QuickSight and a serverless data analytics platform with Athena integration. Covers high-speed visualization with the SPICE engine and practical methods for sharing insights across the organization.Building Blockchain Networks - Leveraging Distributed Ledgers with Amazon Managed Blockchain and QLDBExplains how to build blockchain networks with Amazon Managed Blockchain and use Amazon QLDB as a verifiable ledger database. Covers practical use cases such as supply chain management and ensuring transparency in financial transactions.Privacy-Preserving Data Collaboration with AWS Clean RoomsRun joint analysis across multiple companies without sharing or copying data. Learn about aggregation rules for preventing individual identification and Cryptographic Computing for encrypted analysis.Customer Identity Unification - Resolving Scattered Customer Data with AWS Entity ResolutionLearn how to perform entity resolution (record matching) on customer data using AWS Entity Resolution. This article covers ML-based matching, rule-based matching, privacy protection, and integration with Clean Rooms.Leveraging Third-Party Data with AWS Data Exchange - Data Procurement and Subscription ManagementProcure third-party data products via Marketplace and build automated delivery pipelines to S3. This article also covers how to productize and monetize your own data.Data Lake Governance - Centralized Access Control with AWS Lake FormationLearn about building, access control, and governance for data lakes using AWS Lake Formation. This article covers fine-grained column-level and row-level permission management for S3-based data lakes, along with Glue and Athena integration.Leveraging the Data Marketplace - Efficient Third-Party Data Acquisition and Utilization with AWS Data ExchangeLearn how to acquire and utilize third-party data with AWS Data Exchange. This article covers building data pipelines with S3 integration and publishing data as a data provider.

Data Lake Design Patterns

Governance with Lake Formation

ETL Pipeline Design

Data Lake Cost Optimization

Summary

Related Services

Related Articles

More on This Topic

Similar Articles and Services