Data Analytics — Concept (Athena, Glue, EMR, OpenSearch, QuickSight)
A bundle of services that surround the data lake / warehouse story. The exam mostly tests when to pick each.
Athena
- Serverless SQL over data in S3 (CSV, JSON, ORC, Parquet, Avro).
- Pay per TB scanned ($5 / TB roughly).
- Uses Glue Data Catalog for schema.
- Athena Federated Query → SQL across RDS, DynamoDB, Redshift, etc.
- Best practices: columnar (Parquet/ORC), compress (snappy/zstd), partition (year/month/day), use projections.
- No infra to manage — great for ad-hoc analytics, log queries (CloudFront logs, VPC Flow Logs).
Glue
- Managed ETL service.
- Glue Data Catalog = central metadata (used by Athena, Redshift Spectrum, EMR, Lake Formation).
- Crawlers auto-discover schemas in S3 / JDBC.
- Glue Jobs = Spark- or Python-based ETL (Glue Studio = visual editor).
- Glue DataBrew = no-code data wrangling.
- Glue Streaming ETL for Kinesis / MSK.
- Lake Formation uses Glue under the hood for access control.
EMR (Elastic MapReduce)
- Managed Hadoop / Spark / Hive / Presto / HBase / Flink clusters.
- For heavy custom big-data jobs, ML pipelines, large transformations.
- Run on EC2 (with Spot for cost), or EMR Serverless, or EMR on EKS.
- Best for teams that already use Spark/Hive.
OpenSearch (formerly Elasticsearch)
- Managed search & analytics on JSON documents.
- Use cases: log analytics, full-text search, security analytics (SIEM-like).
- OpenSearch Serverless scales without sizing.
- Kibana / OpenSearch Dashboards for visualization.
- Integrates with Kinesis Firehose → OpenSearch and CloudWatch Logs subscription.
QuickSight
- Managed BI / dashboards with SPICE in-memory engine.
- Connects to S3 / Athena / Redshift / RDS / SaaS sources.
- Per-user / per-session pricing.
- QuickSight ML Insights for anomaly detection.
Lake Formation
- Layer above Glue + S3 to centralize fine-grained data lake permissions (table, column, row level) for analytics services.
Pick-the-service cheatsheet
| You want | Use |
|---|---|
| Ad-hoc SQL on S3, serverless | Athena |
| Petabyte BI warehouse | Redshift |
| Heavy Spark / Hadoop jobs | EMR |
| ETL with managed Spark, schema discovery | Glue |
| Log search / SIEM / full-text | OpenSearch |
| Dashboards & BI on top | QuickSight |
| Cross-source SQL federation | Athena Federated Query or Redshift Federated Query |
| Data lake fine-grained access control | Lake Formation |
Common exam scenarios
- "Query S3 logs ad-hoc, low ops, low cost" → Athena with Parquet + partitioning.
- "Build ETL pipeline from S3 → Redshift, no infra" → Glue.
- "Need to run a 12-hour Spark job" → EMR (Spot for cost).
- "Search through millions of logs by free text" → OpenSearch.
- "Self-service dashboards for finance team" → QuickSight.
- "Sync clickstream → near-real-time analytics dashboards" → Kinesis → Firehose → S3/Redshift → QuickSight (or OpenSearch).
Exam tip
Athena vs Redshift: ad-hoc & cheap = Athena; complex aggregations + BI tools constantly = Redshift. Glue is the glue (catalog + ETL); EMR is for custom Spark/Hadoop.