Amazon Redshift Popular2012年〜
A cloud data warehouse service for high-speed analytics on petabyte-scale data
What It Does
Amazon Redshift is a fully managed cloud data warehouse service for fast aggregation and analysis of large datasets. It runs standard SQL queries against petabyte-scale data and delivers up to 10x better performance than traditional on-premises data warehouses. Its columnar storage and massively parallel processing (MPP) architecture enable aggregation queries across billions of rows in seconds. Redshift Spectrum also lets you query data directly on S3.
Use Cases
Aggregating and reporting on sales and customer data, building dashboards with BI tools, analyzing marketing campaign effectiveness, long-term log storage and trend analysis, and building cross-organizational analytics platforms that integrate multiple data sources.
Everyday Analogy
Think of a massive warehouse. In a regular shelf system (row-oriented database), all information about one product is stored together. Redshift's columnar storage is like having a shelf just for "prices" and another just for "categories" - grouping the same type of information together. When you need "the total of all product prices," you only check the price shelf, so you get your answer dramatically faster without walking through the entire warehouse.
What Is Redshift?
Amazon Redshift is AWS's cloud data warehouse service, launched in 2012. A data warehouse is a specialized database that collects data from various enterprise systems into one place for analysis and reporting. Redshift combines columnar storage with massively parallel processing (MPP) to handle aggregation queries across billions of rows at high speed. It supports PostgreSQL-compatible SQL, so you can connect with existing BI tools and SQL clients.
Key Features
At the core of Redshift is columnar storage. While row-oriented databases store data record by record, columnar storage stores data column by column. Aggregation queries only read the columns they need, dramatically reducing I/O. Additionally, similar values within the same column compress efficiently, lowering storage costs. With Redshift Serverless, you can skip cluster management entirely and pay based on query volume.
Redshift Spectrum and S3 Integration
Redshift Spectrum lets you run SQL queries directly against data on S3 from your Redshift cluster. You can analyze massive amounts of data in your S3 data lake without loading it into Redshift. You can even JOIN Redshift tables with S3 data, enabling a tiered data management strategy with hot data in Redshift and cold data in S3. Schema management integrates with the AWS Glue Data Catalog. For hands-on implementation of Redshift Spectrum and S3 integration, reference books on Amazon cover the topic in detail.
Performance Optimization
Redshift includes features that automatically optimize data distribution and sort keys. Setting appropriate distribution keys minimizes inter-node data transfer during JOIN operations. Sort keys improve the efficiency of range scans and filtering. Materialized views let you pre-compute and store results of frequently run complex queries, reducing response times. Concurrency Scaling automatically expands the cluster when concurrent query volume increases, maintaining performance.
Getting Started
The easiest way to start with Redshift is Redshift Serverless. In the Redshift console, select "Serverless," create a namespace and workgroup, and you'll have a query environment ready in minutes. Use Query Editor v2 to run SQL directly in the browser. Start by loading sample data from S3 with the COPY command and try some aggregation queries. For production use, building ETL pipelines with AWS Glue or Step Functions is the standard approach.
Things to Watch Out For
- Redshift Serverless is easy to start with, but provisioned clusters may be more cost-effective for consistently heavy query workloads
- Use the COPY command for loading data from S3 - avoid row-by-row INSERT statements. COPY is optimized for parallel loading
- Due to columnar storage characteristics, Redshift is not suited for OLTP workloads with frequent single-row updates/deletes. Consider RDS or DynamoDB for OLTP