Microsoft Certified: Azure Data Engineer Associate Study Guide

Design a Partition Strategy for Efficiency and Performance – Data Sources and Ingestion

Jas Moore Updated on 08/03/202408/03/2024

Both efficiency and performance were discussed in earlier sections pertaining to the design of a partition strategy. An efficient query is one in which the time required to execute it is well used. That means the query should not be waiting on data shuffling or querying irrelevant data. The most efficient query would be one […]

CARTESIAN JOIN Microsoft DP-203

Design a Partition Strategy – Data Sources and Ingestion

Jas Moore Updated on 08/03/202407/23/2024

A core objective of your data analytics solution is to have queries return results within an acceptable amount of time. If the dataset on which a query executes is huge, then you might experience unacceptable latency. In this context, “dataset” refers to a single database table or file. As the volume of data increases, you […]

Microsoft DP-203 Training and Enrichment

Design a Folder Structure That Represents the Levels of Data Transformation – Data Sources and Ingestion

Jas Moore Updated on 08/03/202406/23/2024

Once data is ingested and initially stored into what is commonly referred to as a data landing zone (DLZ), the data will flow through the other Big Data stages. More, in‐depth detail about the Big Data transformation stage is covered in Part III, “Develop Data Processing.” For now, only the file structure to support data […]

AVG, MAX, MIN, SUM, COUNT CARTESIAN JOIN Microsoft DP-203

Design for Efficient Querying – Data Sources and Ingestion

Jas Moore Updated on 08/03/202405/28/2024

You can take numerous steps to optimize the performance and manageability of your files contained on ADLS. The following actions can improve query efficiency: Use this information as a basis for the design of your storage structure. File Size, Type, and Quantity The more data contained within a file, the larger it is and the […]

AVG, MAX, MIN, SUM, COUNT Design a Partition Strategy Microsoft DP-203

Recommended File Types for Storage – Data Sources and Ingestion

Jas Moore Updated on 08/03/202404/07/2024

Chapter 2 introduced the numerous file types and their use cases. If you need a refresher, go back to Chapter 2 to review. The following file formats are used most when working in the Big Data context: This code loads an existing session into a DataFrame and then creates a new DataFrame to contain the […]

Microsoft DP-203 Where Does Data Come From?

Partitioning – Data Sources and Ingestion

Jas Moore Updated on 08/03/202403/26/2024

As discussed in Chapter 2, partitioning is a way to logically structure data. The closer queried data physically exists together, the faster the query will render results. What you learned in Chapter 2 related to PolyBase and CTAS, where you added a PARTITION argument to the WITH clause; therefore, the data was allocated properly across […]

Design a Partition Strategy Microsoft DP-203

Design a Distribution Strategy – Data Sources and Ingestion

Jas Moore Updated on 08/03/202402/11/2024

When running your Big Data workloads using Azure Synapse Analytics dedicated SQL pools, how you distribute your data is worthy of meticulous consideration. To summarize, distribution is concerned with the way data is loaded onto the numerous nodes (aka compute machine) running your data analytics queries. When you execute a query, the platform chooses a […]

AVG, MAX, MIN, SUM, COUNT Microsoft DP-203

Design a Partition Strategy for Files – Data Sources and Ingestion

Jas Moore Updated on 08/03/202401/13/2024

Having an intuitive directory structure for the ingestion of data is a prequel to implementing the partitioning strategy. You may not know how the received files will be formatted in all scenarios; therefore, analysis and preliminary transformation is often required before any major actions like partitioning happens. A directory structure similar to the following is […]

AVG, MAX, MIN, SUM, COUNT Microsoft DP-203 Where Does Data Come From?

Create an Azure Data Lake Storage Container – Data Sources and Ingestion-3

Jas Moore Updated on 08/03/202412/23/2023

Some Azure products generate costs even when not actively used, whereas others do not. An empty ADLS container does not incur any costs, but one that consumes space does. You should remove resources that are no longer being used. Make sure to perform due diligence when provisioning Azure products, as you will be required to […]

CARTESIAN JOIN Design a Partition Strategy Microsoft DP-203

Create an Azure Data Lake Storage Container – Data Sources and Ingestion-2

Jas Moore Updated on 08/03/202411/02/2023

The following options are available on the Advanced tab: Begining with the selections you made during the provisioning of ADLS in Exercise 3.1, start with Enable Hierarchical Namespaces. If you do not select this, instead of getting an ADLS container, you get a general‐purpose v2‐based blob container. As discussed in Chapter 1, blob containers are […]