Required Knowledge To Pass AWS Certified Data Analytics Specialty Exam
This article provides details on the AWS Data Analytics services knowledge required to pass the AWS Certified Data Analytics – Specialty certification.
Join the DZone community and get the full member experience.
Join For FreeDisclaimer: All the views and opinions expressed in the blog belong solely to the author and not necessarily to the author's employer or any other group or individual. This article is not a promotion for any course or training platform. The sole objective of this article is to help the AWS community to successfully pass this difficult exam. Also, this article is based on my exam experience, which may differ from any other individual's exam experience.
I passed the AWS Certified Data Analytics – Specialty exam in October 2022. With this article, I would like to share my experience and the preparations I took to pass this certification exam. I don't want to share the details that you can get from the AWS certification page; rather, I would share the topics that you would need to know to pass the exam and the type of questions that you can expect during the exam.
The Courses and Practice Exams That I Took for This Exam
You can follow any course and take any practice exam that covers all the topics in the AWS Certification Exam Guide. Below is just the list of courses and practice exams that I took during my preparation. But feel free to use any other courses.
- I took the AWS Certified Data Analytics Specialty 2022 - Hands On! by Stephane Maarek and Frank Kane on Udemy.
- I took several practice exams. The explanations cleared a lot of doubts. Below is the list of practice exams that I took. You may not need to take all of them, but I highly recommend taking at least one before the exam and assessing your knowledge.
- Practice Exams | AWS Certified Data Analytics Specialty by Stephane Maarek and Abhishek Singh
- AWS Certified Data Analytics Specialty Practice Exams by Jon Bonso
- AWS Certified Data Analytics - Specialty (DAS-C01) Certification Preparation for AWS by Stephen Cole(I completed only the knowledge checks and the exam prep parts)
- I also completed the self-paced online free digital training provided by AWS Skill Builder(free content). AWS content provides a good comparison between services, and it's free.
- I have prepared a personal note from the courses and the practice exams that I used to refer to a lot. I highly recommend creating a simple note that you can use to capture the details that are not very well known to you and something that you can refer to anytime, especially before the exam.
Required AWS Service Knowledge To Pass the Exam
In general, you need to know the different services mentioned in the exam guide at a high level, their use cases, and their differences. You will be asked to choose a service based on the requirement in the question. Know the consumer and providers for each, and be sure to look for keywords such as "cost-effective," "less management," "guaranteed delivery," "highly available," etc. In the below sections, I have consolidated the important services based on my exam experience.
Domain 1: Collection
Kinesis Family
- You need to know Kinesis Data Streams(KDS) and Kinesis Data Firehose(KDF) well. There will be many questions/answers where you will find Kinesis. Understand inner workings, security, scaling, and different consumers and producers for both. Know what is Enhanced Fan Out for KDS. Know that the key features of KDS are real-time, retention, replay, and guaranteed ordering. But KDF is near real-time (minimum 60 secs delay), it's serverless, has transformation capabilities, can perform format conversion, and provides batching, but no retention.
- Have a high-level understanding of Kinesis Data Analytics(KDA) and know that it can use
RANDOM_CUT_FOREST
for anomaly detection on streaming data. - Expect questions on the CloudWatch Log subscription using KDS, KDF, and Lambda.
Other AWS Collection services
- You need to know SQS(Simple Queue Service), MSK(Managed Streaming for Apache Kafka, high level), Database Migration Service (DMS, high level), SNS (Simple Notification Service, high level), and AWS Snow Family (high level) among other collection services. Know the difference and use cases for KDS vs. KDF vs. SQS vs. MSK (You will be asked to pick the correct collection service based on the question scenario).
Domain 2: Storage and Data Management
Expect most of the questions on S3(you need to know S3, VPC, DynamoDB, and EC2 to pass any AWS exam) and Redshift for this section. You need to know both of these services very well to score high in this section. Have a good understanding of Operational Datastore (RDS, Dynamo, Elastic Cache, Neptune) that typically has the following characteristics - data stored in a row-based format, smaller compute size, low latency, high throughput, high concurrency, high change velocity. Whereas Analytic data stores(S3, Redshift) commonly store data in a columnar format, datasets are large and use partitioning, large compute size, regularly perform complex joins and aggregations, bulk loading, and low change velocity.
S3
- You need to know S3 well. Understand the S3 Storage classes and how Lifecycle rules can move objects between different S3 Storage classes to save cost, replication, versioning, and S3 security (bucket policy, different encryption mechanisms).
- You need to know S3 Select(you can retrieve only a portion of the data, a faster and cheaper option) and Glacier Select(you can query archived uncompressed CSV files, the easiest, fastest, management-free option). Also, know that by using the range HTTP header in an S3 GET Object request, you can fetch a byte range from an object rather than retrieve the whole object.
- Expect questions on S3 Select, Glacier Select, S3 security, S3 Storage classes, and Lifecycle rules.
- DynamoDB
- You need to know the basics of DynamoDB well(LSI, GSI, RCU, WCU, Streams, DAX). Know that DynamoDB can be a source for Glue crawler, and Apache Hive in EMR can query and join multiple DynamoDB tables. Know that each KCL application must use its own DynamoDB table.
- Elastic Cache
- Have an overview of Elastic Cache. Understand the 2 different options and use cases(caching, chat/messaging, gaming leaderboards, geospatial, session store, etc.). Know that in-memory data stores like Redis or Memcached (Elastic Cache options)can be used to store transient data for fast retrieval due to performance reasons.
- Redshift
- Expect a lot of questions on Redshift. You need to know Redshift architecture, best practices, different node types, Redshift Spectrum, Distribution Styles, Replication & Backup(including copy snapshots to another region), Redshift cluster scaling, importing & exporting data to & from Redshift cluster, Redshift Workload Management (WLM), Concurrency Scaling, Short Query Acceleration (SQA), Elastic resize vs. Classic resize, VACUUM command(Recovers space from deleted rows), AQUA(Advanced Query Accelerator), Redshift security(integration with HSM, audit logging, encryption, etc.), Redshift Serverless. Know that Redshift has much better performance than Athena for complex analytical queries, Redshift QuickSight integration(especially if both of them are in a different region), and cluster high availability(especially as the Redshift clusters reside within a single AZ).
Domain 3: Processing
This section focuses on EMR (Elastic Map Reduce) and Glue. You should already know Lambda (know the lambda integration with S3 event, Kinesis Data Firehose, Kinesis Data Analytics, CloudWatch Log subscription, SQS, and SNS)
- EMR
- Expect a lot of questions on EMR. You need to know EMR architecture, best practices, different node types, different Hadoop tools supported in EMR, HDFS vs. EMRFS, EMR Automatic Scaling vs. EMR Managed Scaling, EMR serverless(high level), EMR security( at rest for EMFS and local disks & in-transit encryption, cluster audit logs), S3DistCp(copy data to & from S3 into HDFS), High availability configuration, bootstrapping EMR cluster. Know that Spot instances are a great choice for Task nodes.
- Glue
- Expect a lot of questions on Glue as well. You need to know Glue Crawler, Glue ETL, Glue Catalog(including Classifiers and the difference with Hive Metastore), and Job bookmarks. Understand the different systems Glue can connect, out-of-the-box transformation functions available with Glue ETL, and Glue crawler can be run with a minimum of 5-minute intervals.
- Understand the difference between Glue and EMR(when to use what). Glue is the serverless Spark platform primarily used for ETL workloads with zero maintenance & operational overhead for batch-oriented workloads. AWS Glue would cost less than long-standing infrastructure such as an EMR cluster.
- Whereas EMR provides lower-level access to the Hadoop environment and greater flexibility in using tools beyond Spark.
Domain 4: Analysis and Visualization
This section focuses on Athena and QuickSight with a little bit of Amazon Elasticsearch Service (Amazon ES, now OpenSearch). The exam expects you to know the use cases for choosing the correct service.
Athena
- Understand what is Athena, what it's used for, supported file formats, and Athena security. Expects questions on Athena Workgroup, cross-region query using Athena, per-query limit, per-workgroup limit, and query cost reduction(e.g., columnar data formats, partition, compression, etc.). Consider Athena for management-free, interactive, ad-hoc queries on your data sitting in Data Lake(e.g., S3).
- QuickSight
- Know the data sources and file formats supported by QuickSight. Expect questions on QuickSight Visual Types(which one to choose based on the scenario), QuickSight Security(MFA, VPC connectivity, Row-level security, Column-level security, authentication especially using Active Directory), connectivity with Redshift cluster in a different region, the difference between QuickSight Standard and Enterprise editions.
Domain 5: Security
I have mostly covered the security part while covering the services in the above sections. The exam expects you to know basic security services in the AWS platform, e.g., IAM, VPC, VPC Endpoints, KMS, Federation, CloudTrail, and HSM(Redshift cluster integration). Expect encryption at rest, network access, and authentication questions on S3, QuickSight, RedShift, EMR, and Athena.
Opinions expressed by DZone contributors are their own.
Comments