Data Analytics — Concept (Athena, Glue, EMR, OpenSearch, QuickSight)

A bundle of services that surround the data lake / warehouse story. The exam mostly tests when to pick each.

Athena

Serverless SQL over data in S3 (CSV, JSON, ORC, Parquet, Avro).
Pay per TB scanned ($5 / TB roughly).
Uses Glue Data Catalog for schema.
Athena Federated Query → SQL across RDS, DynamoDB, Redshift, etc.
Best practices: columnar (Parquet/ORC), compress (snappy/zstd), partition (year/month/day), use projections.
No infra to manage — great for ad-hoc analytics, log queries (CloudFront logs, VPC Flow Logs).

Glue

Managed ETL service.
Glue Data Catalog = central metadata (used by Athena, Redshift Spectrum, EMR, Lake Formation).
Crawlers auto-discover schemas in S3 / JDBC.
Glue Jobs = Spark- or Python-based ETL (Glue Studio = visual editor).
Glue DataBrew = no-code data wrangling.
Glue Streaming ETL for Kinesis / MSK.
Lake Formation uses Glue under the hood for access control.

EMR (Elastic MapReduce)

Managed Hadoop / Spark / Hive / Presto / HBase / Flink clusters.
For heavy custom big-data jobs, ML pipelines, large transformations.
Run on EC2 (with Spot for cost), or EMR Serverless, or EMR on EKS.
Best for teams that already use Spark/Hive.

OpenSearch (formerly Elasticsearch)

Managed search & analytics on JSON documents.
Use cases: log analytics, full-text search, security analytics (SIEM-like).
OpenSearch Serverless scales without sizing.
Kibana / OpenSearch Dashboards for visualization.
Integrates with Kinesis Firehose → OpenSearch and CloudWatch Logs subscription.

QuickSight

Managed BI / dashboards with SPICE in-memory engine.
Connects to S3 / Athena / Redshift / RDS / SaaS sources.
Per-user / per-session pricing.
QuickSight ML Insights for anomaly detection.

Lake Formation

Layer above Glue + S3 to centralize fine-grained data lake permissions (table, column, row level) for analytics services.

Pick-the-service cheatsheet

You want	Use
Ad-hoc SQL on S3, serverless	Athena
Petabyte BI warehouse	Redshift
Heavy Spark / Hadoop jobs	EMR
ETL with managed Spark, schema discovery	Glue
Log search / SIEM / full-text	OpenSearch
Dashboards & BI on top	QuickSight
Cross-source SQL federation	Athena Federated Query or Redshift Federated Query
Data lake fine-grained access control	Lake Formation

Common exam scenarios

"Query S3 logs ad-hoc, low ops, low cost" → Athena with Parquet + partitioning.
"Build ETL pipeline from S3 → Redshift, no infra" → Glue.
"Need to run a 12-hour Spark job" → EMR (Spot for cost).
"Search through millions of logs by free text" → OpenSearch.
"Self-service dashboards for finance team" → QuickSight.
"Sync clickstream → near-real-time analytics dashboards" → Kinesis → Firehose → S3/Redshift → QuickSight (or OpenSearch).

Exam tip

Athena vs Redshift: ad-hoc & cheap = Athena; complex aggregations + BI tools constantly = Redshift. Glue is the glue (catalog + ETL); EMR is for custom Spark/Hadoop.

Data Analytics