Design a Partition Strategy – Data Sources and Ingestion

Jas Moore Updated on 08/03/202407/23/2024

A core objective of your data analytics solution is to have queries return results within an acceptable amount of time. If the dataset on which a query executes is huge, then you might experience unacceptable latency. In this context, “dataset” refers to a single database table or file. As the volume of data increases, you must find a solution for making the queries more performant. The solution is partitioning. Consider, for example, that you have one file that is 60 GB. When you run a query on that file, the entire contents of the file must be parsed. It might be less latent if the 60 GB file were broken into two 30 GB files instead. The same goes for a data table that has 60,000,000 rows. It makes sense that a query on a partitioned table of 1,000,000 rows would perform faster than on a single 60,000,000 row table. This, of course, assumes you can effectively and logically split the data into multiple files or partitions. To summarize, the more data you query at once, the longer it takes.

The partitioning strategy should also consider how the data is to be distributed across your SQL pool or Spark pool nodes. Partitions themselves can become too big for the selected node sizes, which have limits on CPU and memory. If you only have a single user or limited scheduled parallel pipeline executions, then perhaps a small Spark pool node with 32 GB of memory is enough. However, if you run numerous jobs, in parallel on big datasets, then that amount of memory would not be enough. You would either need to optimize the distribution configuration or choose a larger Spark pool node size. Note that data retrieved via a query loads it into memory, so if you were to perform a SELECT * from a 60 GB file, your Spark pool node would run out of memory and throw an exception. Table 3.9 includes more information on Spark pool node sizes. The point is, you need to structure your files and tables so that your datasets can fit into memory, while also considering the possibility of queries running in parallel on the same node.

When partitioning, it is important to know about the law of 60, which applies to Azure Synapse Analytics dedicated SQL pools. Figure 3.11 illustrates how to choose the maximum number of SQL pools you can scale out to. A performance level of DW1000c will scale out to a maximum of two nodes with 30 distributions on each node, which is a total of 60 distributions. The largest performance level is currently DW30000c, which can scale out to 60 nodes that contain a single distribution on each node, again totaling 60 distributions. Each time the platform determines that you need a new node, the defined table distribution for the workflow will be copied to the new node in the ratio bound to the selected performance level.

FIGUER 3.11 Configuring a new dedicated SQL pool

If your table has 60,000,000 rows and two nodes, then 30,000,000 rows will be copied to each node, assuming you are using the default distribution method, round‐robin. Your distribution must be greater than 1,000,000 rows per node to benefit from compressed column store indexing. This is where the law of 60 comes into play again, in that if you choose the highest performance level, which has 60 nodes, then your table needs to be at least 60,000,000 rows to meet the 1,000,000‐row requirement for caching.

Leave a Reply Cancel reply

Recommended Articles