Redshift — Concept
What it is
Amazon Redshift = a managed petabyte-scale data warehouse (OLAP). Columnar storage, massively parallel processing (MPP), SQL on TB–PB of historical data.
Why it exists
RDS/Aurora are OLTP (transactional, row-based) and can't efficiently scan billions of rows. Redshift is OLAP: columnar storage, compression, parallel scans — for BI dashboards, reporting, analytics over historical data.
Architecture
- Cluster = 1 leader node + many compute nodes.
- Leader parses SQL, plans, aggregates results.
- Compute nodes hold partitions ("slices") and run in parallel.
- Node types: RA3 (managed storage, separate compute/storage), DC2 (legacy SSD).
Redshift Serverless
- No cluster sizing — scale automatically.
- Pay per RPU-hour.
- Best for variable / unpredictable analytics workloads.
Spectrum
- Query data directly in S3 without loading.
- Schema in Glue Data Catalog.
- Good for data lake federation; cheap because compute is separate from S3 storage.
Concurrency
- Concurrency Scaling = adds extra clusters on demand for read peaks; you get free credits.
- Workload Management (WLM) = queues with priorities; auto-WLM available.
Loading & integrations
COPYfrom S3 (best practice, parallel).- Federated Query: Postgres / RDS / Aurora live tables.
- Redshift Streaming Ingestion from Kinesis / MSK.
- AWS DMS for migration from other DBs.
Backups
- Automated snapshots to S3 (1-day default, configurable up to 35).
- Cross-region copy for DR.
Security
- VPC, KMS encryption, IAM auth, audit logs to S3.
- Enhanced VPC routing for COPY/UNLOAD over VPC.
When to use vs alternatives
| Use ... | Instead of ... | When ... |
|---|---|---|
| Redshift | Aurora | TB–PB analytical reporting, complex aggregations |
| Athena | Redshift | Ad-hoc queries on S3, low/no setup, pay per query |
| EMR / Spark | Redshift | Heavy custom transformations, ML pipelines |
| OpenSearch | Redshift | Log search and full-text |
| DynamoDB | Redshift | OLTP, not analytics |
| Redshift Serverless | Provisioned | Variable / unknown load, less ops |
| Redshift Spectrum | COPY into Redshift | Data lake; want to query S3 directly |
Common exam scenarios
- "BI dashboards over 50 TB of historical data" → Redshift.
- "Ad-hoc SQL on data already in S3 / no infra" → Athena.
- "Query S3 + warehouse together" → Redshift Spectrum.
- "Streaming ingestion of clickstream into warehouse" → Kinesis → Redshift Streaming Ingestion.
- "Variable analytics workload, no ops" → Redshift Serverless.
- "Federated SQL over Aurora + Redshift without ETL" → Federated Query.
Exam tip
Athena = serverless SQL over S3, ad-hoc, cheap, no infra. Redshift = persistent warehouse, faster on big joins/aggregations at scale, you load data in. EMR = full Hadoop/Spark stack for custom big-data pipelines.