← Back to Quiz Home
This quiz covers performance optimization in S3 and Redshift, streaming architectures, and handling schema changes.
How do you optimize S3 performance for high request rates (thousands of PUT/GET per second)?
S3 scales automatically, but using distinct prefixes allows it to scale partitions horizontally for massive throughput.
What is the "Small File Problem" in Hadoop/Spark/Athena?
Query engines spend more time listing files and opening connections than reading data. Solution: Compact files into larger chunks (e.g., 128MB).
When should you choose Amazon EMR over AWS Glue?
EMR gives you "root" access to the cluster, ideal for specific Hadoop/Spark tuning or custom binaries.
What is "Partitioning" in the context of Athena/S3?
Partitioning dramatically reduces cost and improves speed by preventing full table scans.
Which distribution style in Redshift optimizes for joins between two large tables?
colocating rows with matching keys on the same node minimizes network shuffle during the join operation.
How do you handle schema evolution (e.g., adding a new column) in a Parquet-based data lake?
Columnar formats like Parquet and Avro are designed to handle schema add/remove gracefully.
What is the main difference between Amazon QuickSight and Tableau on EC2?
QuickSight is native to AWS and charges per session/user without server admin overhead.
How do you securely connect QuickSight to a private RDS instance?
Attaching QuickSight to the VPC allows it to route traffic to internal IPs securely.
Which service is used to orchestrate complex data workflows involving dependencies (e.g., Lambda -> Glue -> Redshift)?
Step Functions provides a state machine to manage retries, parallel branches, and error handling for critical pipelines.
What is the role of the "Sort Key" in Redshift?
Zone maps allow Redshift to skip blocks that don't fall within the requested Sort Key range, speeding up queries.
If you need to query logs in S3 but only care about records with "ERROR", how can you avoid scanning the whole file?
S3 Select allows you to retrieve only a subset of data from an object by using simple SQL expressions.
What is key difference between "Stream Processing" and "Batch Processing"?
Stream processing is for low-latency insights; Batch is for comprehensive, high-volume analysis.
How can you ensure PII data is not stored in your clean data lake?
Proactive masking/hashing during the ETL phase is the best practice for data privacy.
Which Redshift feature allows you to manage concurrent query execution queues?
WLM allows you to define queues (e.g., "ETL", "Dashboard") and assign memory/concurrency limits to prevent one from starving the other.
What is the benefit of "Columnar Storage" (like Parquet) over Row-based (like CSV)?
For analytics where you often select only 3-4 columns out of 50, columnar storage is vastly more efficient.
How do you monitor the "lag" in a Kinesis Data Stream consumer?
Iterator Age tells you how far behind (in time) your consumer application is from the tip of the stream.
Which service would you use to catalog metadata from an on-premise JDBC database?
Glue Crawlers can connect to JDBC targets to extract schema information.
What is a common use case for DynamoDB in a data engineering pipeline?
DynamoDB provides fast, predictable read/write performance for state tracking or looking up individual records during processing.
How does Kinesis Data Firehose handle data transformation before loading to S3?
Firehose supports inline Lambda transformation for simple modifications (like parsing logs) before delivery.
What is the purpose of "Lifecycle Policies" in S3 for a Data Lake?
Data Lakes grow indefinitely; lifecycle policies ensure you don't pay "Standard" prices for data from 3 years ago.
Quiz Progress
0 / 0 questions answered
(0%)
0 correct
Quiz Complete!
0%
📚 Study Guides
📬 Weekly DevOps, Cloud & Gen AI quizzes & guides