Menu
Design a Partition Strategy for Efficiency and Performance – Data Sources and Ingestion

Design a Partition Strategy for Efficiency and Performance – Data Sources and Ingestion

Both efficiency and performance were discussed in earlier sections pertaining to the design of a partition strategy. An efficient query is one in which the time required to execute it is well used. That means the query should not be waiting on data shuffling or querying irrelevant data. The most efficient query would be one […]

Design a Partition Strategy – Data Sources and Ingestion

Design a Partition Strategy – Data Sources and Ingestion

A core objective of your data analytics solution is to have queries return results within an acceptable amount of time. If the dataset on which a query executes is huge, then you might experience unacceptable latency. In this context, “dataset” refers to a single database table or file. As the volume of data increases, you […]

Design a Folder Structure That Represents the Levels of Data Transformation – Data Sources and Ingestion

Design a Folder Structure That Represents the Levels of Data Transformation – Data Sources and Ingestion

Once data is ingested and initially stored into what is commonly referred to as a data landing zone (DLZ), the data will flow through the other Big Data stages. More, in‐depth detail about the Big Data transformation stage is covered in Part III, “Develop Data Processing.” For now, only the file structure to support data […]

Design for Efficient Querying – Data Sources and Ingestion

Design for Efficient Querying – Data Sources and Ingestion

You can take numerous steps to optimize the performance and manageability of your files contained on ADLS. The following actions can improve query efficiency: Use this information as a basis for the design of your storage structure. File Size, Type, and Quantity The more data contained within a file, the larger it is and the […]

Recommended File Types for Storage – Data Sources and Ingestion

Recommended File Types for Storage – Data Sources and Ingestion

Chapter 2 introduced the numerous file types and their use cases. If you need a refresher, go back to Chapter 2 to review. The following file formats are used most when working in the Big Data context: This code loads an existing session into a DataFrame and then creates a new DataFrame to contain the […]

Partitioning – Data Sources and Ingestion

Partitioning – Data Sources and Ingestion

As discussed in Chapter 2, partitioning is a way to logically structure data. The closer queried data physically exists together, the faster the query will render results. What you learned in Chapter 2 related to PolyBase and CTAS, where you added a PARTITION argument to the WITH clause; therefore, the data was allocated properly across […]

Design a Distribution Strategy – Data Sources and Ingestion

Design a Distribution Strategy – Data Sources and Ingestion

When running your Big Data workloads using Azure Synapse Analytics dedicated SQL pools, how you distribute your data is worthy of meticulous consideration. To summarize, distribution is concerned with the way data is loaded onto the numerous nodes (aka compute machine) running your data analytics queries. When you execute a query, the platform chooses a […]

Design a Partition Strategy for Files – Data Sources and Ingestion

Design a Partition Strategy for Files – Data Sources and Ingestion

Having an intuitive directory structure for the ingestion of data is a prequel to implementing the partitioning strategy. You may not know how the received files will be formatted in all scenarios; therefore, analysis and preliminary transformation is often required before any major actions like partitioning happens. A directory structure similar to the following is […]