Menu

Design a Distribution Strategy – Data Sources and Ingestion

When running your Big Data workloads using Azure Synapse Analytics dedicated SQL pools, how you distribute your data is worthy of meticulous consideration. To summarize, distribution is concerned with the way data is loaded onto the numerous nodes (aka compute machine) running your data analytics queries. When you execute a query, the platform chooses a node to perform the execution on. That node may or may not locally contain the data required to complete the query. If the node contains all the data for the query, you would expect a faster query time when compared to a query run on a node that must pull the data from another location before executing the query. There are three different distribution techniques for data in this context:

  • Round‐robin
  • Hash
  • Replicated

Round‐robin is the default distribution type. The distribution of data across the nodes (you get a maximum of 60) is random. In a scenario where you need to frequently load or refresh data, this would be a good choice for distribution. A hash distribution type spreads data across nodes based on an identified key. For example, if you are analyzing brain waves, you can place all readings with an ELECTRODE = 1 onto a specific node. All queries where this data selection is requested will be routed to the specific node that contains the data. Consider that there are five electrodes. Adding the snippet DISTRIBUTION = HASH([ELECTRODE_ID]) to the CTAS statement would result in five nodes receiving the data for each electrode. Use hash distribution if your dataset is large enough to warrant having a dedicated SQL pool node to respond to those kinds of queries. A replicated distribution type copies all the data for the identified table to all provisioned nodes that are running and responding to queries. This distribution type is best for small datasets (less than 2 GB).

Design a Data Archiving Solution

Make sure to differentiate between purging (deleting) and archiving data. The former means you completely remove the data from any existing datastore within your control. Purging is becoming more important from the perspective of security, privacy, governance, and compliance. There is simply no longer any reason to persist collected data permanently; it should be completely deleted at some point. (See Part IV, “Secure, Monitor, and Optimize Data Storage and Data Processing,” for more information.) Archiving data means moving data to a less frequently accessed datastore from a production environment due to its relevance and need. Changes in business objectives, age of data, changes in regulations, or changes in customer behaviors could make the data collected over a given time unnecessary. If those things happen, you might not want to delete the data immediately but keep it for some time before making that decision. But in all cases, you would want to remove the irrelevant data from the tables or directories so that it no longer influences your data analytics and business insights gathering activities. You would also do this because you can save some cost.

Azure storage accounts offer lifecycle management policies. A lifecycle management policy enables you to automatically move data between the different blob storage access tiers, such as Hot, Cool, and Archive. You can configure a rule to move data from Hot to Cool or Archive, or from Cool to Archive, or even to delete the data. The decision point for when the access tier changes can be based on days from creation date or time since last modification. You can also manually change the access tier from the Azure portal, by selecting the file and clicking the Change Tier button, as shown in Figure 3.10.

FIGUER 3.10 Changing the access tier of files in an ADLS container

If your data is stored in folders named for the date and time ({yyyy}/{mm}/{dd}/{hh}), then you could write an automated batch process to remove data within certain folders based on age. When working with data stored on tables, the approach is a bit different. Files have metadata, such as created and updated date, while rows and columns do not have that metadata easily accessible. If you need to implement some archiving on tables, you might consider adding a column that identifies the last time the data was accessed, updated, and/or originally inserted. You could use those column values as the basis for identifying relevant and irrelevant data. If the data is deemed irrelevant, due to age or access, it can either be moved to a different schema or deleted. You could consider such a data archive process as part of a pipeline operation, which you will create in the next chapter. You might also consider creating a copy of the table and append the word archive to it.

Leave a Reply

Your email address will not be published. Required fields are marked *