☁︎SAA-C03

Data Analytics

Data Analytics — Concept (Athena, Glue, EMR, OpenSearch, QuickSight)

A bundle of services that surround the data lake / warehouse story. The exam mostly tests when to pick each.

Athena

  • Serverless SQL over data in S3 (CSV, JSON, ORC, Parquet, Avro).
  • Pay per TB scanned ($5 / TB roughly).
  • Uses Glue Data Catalog for schema.
  • Athena Federated Query → SQL across RDS, DynamoDB, Redshift, etc.
  • Best practices: columnar (Parquet/ORC), compress (snappy/zstd), partition (year/month/day), use projections.
  • No infra to manage — great for ad-hoc analytics, log queries (CloudFront logs, VPC Flow Logs).

Glue

  • Managed ETL service.
  • Glue Data Catalog = central metadata (used by Athena, Redshift Spectrum, EMR, Lake Formation).
  • Crawlers auto-discover schemas in S3 / JDBC.
  • Glue Jobs = Spark- or Python-based ETL (Glue Studio = visual editor).
  • Glue DataBrew = no-code data wrangling.
  • Glue Streaming ETL for Kinesis / MSK.
  • Lake Formation uses Glue under the hood for access control.

EMR (Elastic MapReduce)

  • Managed Hadoop / Spark / Hive / Presto / HBase / Flink clusters.
  • For heavy custom big-data jobs, ML pipelines, large transformations.
  • Run on EC2 (with Spot for cost), or EMR Serverless, or EMR on EKS.
  • Best for teams that already use Spark/Hive.

OpenSearch (formerly Elasticsearch)

  • Managed search & analytics on JSON documents.
  • Use cases: log analytics, full-text search, security analytics (SIEM-like).
  • OpenSearch Serverless scales without sizing.
  • Kibana / OpenSearch Dashboards for visualization.
  • Integrates with Kinesis Firehose → OpenSearch and CloudWatch Logs subscription.

QuickSight

  • Managed BI / dashboards with SPICE in-memory engine.
  • Connects to S3 / Athena / Redshift / RDS / SaaS sources.
  • Per-user / per-session pricing.
  • QuickSight ML Insights for anomaly detection.

Lake Formation

  • Layer above Glue + S3 to centralize fine-grained data lake permissions (table, column, row level) for analytics services.

Pick-the-service cheatsheet

You wantUse
Ad-hoc SQL on S3, serverlessAthena
Petabyte BI warehouseRedshift
Heavy Spark / Hadoop jobsEMR
ETL with managed Spark, schema discoveryGlue
Log search / SIEM / full-textOpenSearch
Dashboards & BI on topQuickSight
Cross-source SQL federationAthena Federated Query or Redshift Federated Query
Data lake fine-grained access controlLake Formation

Common exam scenarios

  1. "Query S3 logs ad-hoc, low ops, low cost"Athena with Parquet + partitioning.
  2. "Build ETL pipeline from S3 → Redshift, no infra"Glue.
  3. "Need to run a 12-hour Spark job"EMR (Spot for cost).
  4. "Search through millions of logs by free text"OpenSearch.
  5. "Self-service dashboards for finance team"QuickSight.
  6. "Sync clickstream → near-real-time analytics dashboards" → Kinesis → Firehose → S3/Redshift → QuickSight (or OpenSearch).

Exam tip

Athena vs Redshift: ad-hoc & cheap = Athena; complex aggregations + BI tools constantly = Redshift. Glue is the glue (catalog + ETL); EMR is for custom Spark/Hadoop.

References