Amazon Redshift
A fully managed cloud data warehouse service that uses columnar storage and massively parallel processing to analyze petabyte-scale data at high speed
Overview
Amazon Redshift is a fully managed cloud data warehouse purpose-built for large-scale data analytics. Technologies such as columnar storage, massively parallel processing (MPP), automatic compression, and zone maps enable fast execution of complex analytical queries against petabyte-scale data. Redshift Serverless lets you start analyzing data with pay-per-query pricing, without provisioning or managing a data warehouse. Redshift Spectrum queries data directly in S3 without loading it into Redshift, enabling a unified analytics platform that bridges your data lake and data warehouse. AQUA (Advanced Query Accelerator) adds compute capability at the storage layer, accelerating certain query patterns by up to 10x.
How Columnar Storage and MPP Work
Redshift uses an MPP (Massively Parallel Processing) architecture composed of a leader node and compute nodes. The leader node handles query parsing and execution plan creation, while compute nodes store data and perform parallel processing. Columnar storage reads only the columns needed for an analytical query, dramatically reducing I/O compared to row-oriented databases. Data is automatically compressed, with the optimal compression algorithm selected based on data characteristics. Azure Synapse Analytics also uses an MPP architecture, but while Redshift uses PostgreSQL-compatible SQL, Synapse uses T-SQL (SQL Server compatible), so the choice often depends on your existing skill set.
Query Optimization with Distribution Keys and Sort Keys
Redshift query performance depends heavily on table design. Setting the distribution key (DISTKEY) to a column frequently used in JOINs ensures that join targets are co-located on the same node, minimizing inter-node data movement (shuffling). Properly configured sort keys (SORTKEY) enable effective block skipping via zone maps, significantly reducing the volume of data scanned by WHERE clause filters. RA3 instances allow compute and storage to scale independently, and Redshift Managed Storage automatically offloads infrequently accessed data to S3. Related books (Amazon) cover distribution key design best practices in detail.
Data Lake Integration with Spectrum and Serverless
Redshift Spectrum lets you query data in S3 (Parquet, ORC, CSV, JSON) directly without loading it into Redshift, enabling a unified analytics platform that bridges your data lake and data warehouse. You can run queries that JOIN Redshift local tables with S3 external tables, creating a cost-efficient architecture with hot data in Redshift and cold data in S3. Redshift Serverless uses RPU (Redshift Processing Unit)-based billing with no charges when queries are not running, making it significantly cheaper than provisioned clusters for intermittent analytical workloads. Enabling Concurrency Scaling automatically adds clusters when concurrent query volume increases, processing queries without wait times. AQUA (Advanced Query Accelerator) executes filtering and aggregation at the storage layer, accelerating large-scale scans by up to 10x.