Design a Partition Strategy for Efficiency and Performance – Data Sources and Ingestion

Jas Moore Updated on 08/03/202408/03/2024

Both efficiency and performance were discussed in earlier sections pertaining to the design of a partition strategy. An efficient query is one in which the time required to execute it is well used. That means the query should not be waiting on data shuffling or querying irrelevant data. The most efficient query would be one that has a table or file that contains only the data that the query requires. In the grand scale of things, this is not efficient, but in principle it is. It is because that means the data is partitioned to a point that is most optimal. Remember previously where you saw how the brain waves were partitioned on ELECTRODE. If you need all the data for a given electrode, then this would be a very efficient partition. However, if you want only the THETA frequency for a given electrode, then the query would need to parse out all frequencies not equal to THETA.

The amount and size of data parsed when a query is run have a significant impact on performance. Decreasing the size via partitioning can have benefits as well as detriments. The benefits are that the data being queried will be smaller and more efficient. The files will be smaller because the rows in the tables will be fewer. However, if partitioning results in the rows in a table falling below the level necessary to enable column store indexing, then you will experience more latency. Too many partitions, in other words, too many files, often results in an increase in latency, so again, you need to find the optimal balance for your scenario. On a final note, the compute power you provision for your data analytics can overcome many performance‐related problems. If you have significant funding that lets you provision larger machines, this might be your best, easiest, and quickest option, because the cost and risk of making changes also have a price, which could be catastrophic.

Design a Partition Strategy for Azure Synapse Analytics

The reason you partition your data is to improve the efficiency and performance of loading, querying, and deleting it. As previously mentioned, partitioning data improves those activities because it reduces size of the data by grouping together similar data. The smaller the set of data you perform an action on, the faster the action will complete, ceteris paribus. You also know that Azure Synapse Analytics features two pool types: SQL and Spark. In the SQL pool context, it is most common to approach partitioning from a table perspective. The following code snippet is an example of creating a partition:

CREATE TABLE [dbo].[READING] ([READING_ID] INT NOT NULL IDENTITY(1,1),[SCENARIO_ID] INT NOT NULL,[ELECTRODE_ID] INT NOT NULL,[FREQUENCY_ID] INT NOT NULL,[VALUE] DECIMAL(7,3) NOT NULL)WITH( CLUSTERED COLUMNSTORE INDEX,DISTRIBUTION = HASH([SCENARIO_ID]),PARTITION ([SCENARIO_ID] RANGE RIGHT FOR VALUES(1, 2, 3, 4, 5, 6, 7, 8))););

This code snippet creates a partition using the hash distribution type based on the brainjammer scenarios (for example, PlayingGuitar, Meditation, etc.). Although you can create persisted tables on a Spark pool, working with and querying Parquet files are most common and most likely to be questioned on the exam. As shown in Figure 3.12, you can use the partitionBy() method to partition your files.

df.write.partitionBy(‘SCENARIO’).mode(‘overwrite’)
parquet(‘/EMEA/brainjammer/out/2022/01/12/15/SCENARIO.parquet’)

Design a Partition Strategy for Azure Synapse Analytics

Leave a Reply Cancel reply

Recommended Articles