AWS Data Engineer - Advanced Quiz
← Back to Quiz Home
This quiz challenges your ability to design complex data pipelines, secure PII, and optimize high-scale analytical workloads.
Explain the AWS "Lake House" architecture benefit.
The pattern removes silos, allowing Redshift to query S3 (Spectrum) and RDS (Federated Query) in a unified manner.
How do you implement Change Data Capture (CDC) from an on-premise Oracle database to an S3 Data Lake?
DMS reads the source database transaction logs to capture and replicate changes in near real-time.
You need to deduplicate a high-velocity stream of 1 million events per second with minimal latency. Which probabilistic data structure is most efficient?
Bloom filters offer O(1) checking with very small memory footprint, accepting a tiny false positive rate for massive speed.
What is the most secure way to grant Redshift access to S3 data for Spectrum queries?
Redshift assumes the IAM Role to access external catalogs and S3 data on your behalf.
How can you provide column-level access control for sensitive PII data in your Data Lake?
Lake Formation allows you to define granular permissions (hide "SSN" column) for different users accessing the same Glue table.
You are designing a real-time dashboard. Aggregations must be calculated every minute. Which tool is best for the processing layer?
Kinesis Data Analytics can process streaming data with windowed aggregations (e.g., "Tumbling Window") in real-time.
How do you optimize Redshift performance for a table heavily used in joins with another large table?
Colocating join keys on the same node eliminates the network overhead of shuffling data between nodes during the join.
What is the "Vacuum" operation in Redshift and why is it critical?
Deleted rows in Redshift are only marked for deletion. Vacuum actually frees the disk space and re-sorts data for optimal scanning.
Which file format supports "Predicate Pushdown" in Athena?
Parquet stores min/max statistics for each column block, allowing Athena/Spectrum to skip entire blocks that don't match the query filter.
How do you securely share a Glue Data Catalog with another AWS account?
Resource policies allow you to grant cross-account permissions to the metadata store without duplicating data.
Your EMR cluster is running slow due to "skewed data" (one key has 90% of data). How do you handle this?
Data skew causes one node to work while others wait. Salting breaks the large key into smaller sub-keys to balance the load.
What is "Backpressure" in a streaming pipeline?
Handling backpressure (e.g., throttling source, scaling consumer) is critical to prevent system collapse.
How do you implement "Exactly-Once" processing semantics in Kinesis?
Standard streaming often guarantees "At Least Once". achieving "Exactly Once" requires application-level logic or advanced frameworks.
Which option minimizes the cost of storing petabytes of historical logs that effectively never need to be read unless there is a legal audit?
Deep Archive is the absolute lowest cost storage class, with retrieval times of 12-48 hours.
How can you speed up a complex Glue ETL job that is running out of memory (OOM)?
Glue allows you to select "Worker Type" to allocate more memory and CPU to each executor.
What is a "Materialized View" in Redshift?
Materialized views are ideal for speeding up dashboards that run the same complex aggregation query repeatedly.
How do you integrate on-premise Active Directory users with Amazon QuickSight?
Federating identity allows users to login with their corporate credentials.
What mechanism allows Kinesis Data Firehose to convert JSON data to Parquet before writing to S3?
Firehose has native support for format conversion (JSON -> Parquet/ORC) which is more efficient than Lambda.
You need to list billions of objects in an S3 bucket daily for auditing. ListObjects API is too slow and expensive. What is the solution?
S3 Inventory provides a flat file listing of your objects, which you can then query with Athena essentially for free (compared to API costs).
Which scenario warrants using Redshift RA3 nodes (Managed Storage)?
RA3 nodes decouple storage from compute, allowing you to store petabytes of data on S3-backed managed storage without paying for thousands of CPU nodes.
Quiz Progress
0 / 0 questions answered
(0% )
0 correct
Quiz Complete!
0%
Reset quiz
📚 Study Guides
📬 Weekly DevOps, Cloud & Gen AI quizzes & guides