Menu

Partitioning – Data Sources and Ingestion

As discussed in Chapter 2, partitioning is a way to logically structure data. The closer queried data physically exists together, the faster the query will render results. What you learned in Chapter 2 related to PolyBase and CTAS, where you added a PARTITION argument to the WITH clause; therefore, the data was allocated properly across the node pools. In this context, however, the focus is on files and directory structures on ADLS. In the same way you learned about partitions previously, partitions in this context are created by the directory structures themselves. Instead of a query being constrained to a node that contains all the related data, the query is focused at a directory on ADLS.. See the section “Design a Partition Strategy” for more information.

Design for Read Operations

It is important to maximize the performance of your Azure data analytics solution for both read and write operations. If you ever must choose, you might lean towards the optimization of read operations. Although you will be ingesting and transforming a lot of data, most of the activity will be reading the transformed data in search of data insights. A primary objective is to make the data as performant as possible for your users.

Design for Data Pruning

Consider the word pruning to mean the same as elimination or deletion. Removing unnecessary data from your data lake, from a table, or from a file, or removing an entire file, will improve query performance. The less data needing querying means less time required to query. Managing data will also prevent your data lake from becoming a data swamp. Removing unnecessary data means that what remains is relevant and valuable, which is not the case in a data swamp. A data swamp contains large amounts of duplicated and/or irrelevant data, which can negatively influence performance, insights, and costs. Storing excessive, unneeded, or irrelevant data will increase costs, since you pay for storage space. Pruning data is not only a performance‐related activity but also a cost‐reduction and data‐integrity maneuver.

Files generally have some associated metadata that includes a creation and/or update timestamp. You can use this to determine if the content within the file is still valid. Data storage duration can depend on the kind of data. For example, if the data is application or performance logging, it probably doesn’t make sense to keep that for years and years. You might need to save data for some extended timeframe due to governance and compliance reasons, but application logs are not as important as, for example, financial or customer order history. Having your files in the aforementioned directory structure would also be helpful for identifying the age and relevance of your data.

Removing data from a file itself typically requires a parameter that identifies the state of the row or column, perhaps at timestamp. You could query files based on that timestamp, write new files that match the query result, and then purge or archive the old files. Finally, there is also a term called time to live (TTL) that can be applied to data. This setting is mostly related to datastores like SQL Server, Azure Cosmos DB, and MongoDB, but knowing it is available in those scenarios is worth mentioning. TTL is a metadata setting that determines how long the associated data is considered relevant.

Leave a Reply

Your email address will not be published. Required fields are marked *