Amazon Athena

A serverless analytics service that lets you run standard SQL queries against data stored in S3, enabling petabyte-scale data analysis with pay-per-query pricing and no infrastructure management

Overview

Amazon Athena is a serverless interactive query service that lets you run standard SQL queries (based on Trino / Presto) against data stored in S3. There is no need to provision databases or clusters - you simply define a schema over your S3 data and start querying. It supports a wide range of data formats including CSV, JSON, Parquet, ORC, and Avro, and integrates with Glue Data Catalog for schema management. Pricing is based on the amount of data scanned, at $5 per TB. Using columnar formats like Parquet or ORC can dramatically reduce costs since only the required columns are scanned.

How Serverless Analytics Works

Each time you run a query, Athena automatically provisions the necessary compute resources and releases them when the query completes. Unlike Redshift, which requires an always-on cluster, Athena incurs zero cost when no queries are running. The query engine is based on Trino (formerly Presto) and uses distributed processing to scan large datasets at high speed. To maximize Athena's performance, the storage format of your data is critical. CSV and JSON are row-oriented formats, meaning the entire file must be scanned even when retrieving only specific columns. Parquet and ORC are columnar formats that read only the required columns, dramatically reducing the amount of data scanned. Additionally, partitioning your data (e.g., using a year/month/day directory structure) allows Athena to scan only the partitions matching your WHERE clause, improving both cost and performance.

Analyzing CloudTrail Logs and VPC Flow Logs

One of the most common use cases for Athena is AWS log analysis. By delivering CloudTrail logs to S3 and querying them with Athena, you can easily extract a specific user's API call history, access patterns for a particular resource, or a list of API calls that resulted in errors - all using standard SQL. VPC Flow Logs can similarly be delivered to S3 and analyzed with Athena, enabling you to identify traffic patterns from specific IP addresses, list rejected traffic, and pinpoint high-volume data transfer destinations. ALB access logs, S3 access logs, and CloudFront access logs can also be analyzed with Athena. For these log analysis workloads, converting data to Parquet format and applying partitioning beforehand can reduce query costs by over 90%.

Practical Usage Patterns

Athena is ideal for ad-hoc analysis on a data lake. Using S3 as a data lake, you aggregate data from various sources (application logs, IoT data, business data) and perform cross-cutting analysis with Athena. A common pipeline involves using Glue ETL jobs to convert data to Parquet format, managing schemas with Glue Data Catalog, and querying with Athena. Integrating with QuickSight lets you visualize Athena query results as dashboards. The corresponding Azure service is Azure Synapse Analytics Serverless SQL Pool, which similarly lets you run SQL queries against data in Azure Blob Storage. To learn Athena from the basics to advanced topics, books (Amazon) offer a structured learning path.

共有するXB!