Menu

Design a Partition Strategy for Files – Data Sources and Ingestion

Having an intuitive directory structure for the ingestion of data is a prequel to implementing the partitioning strategy. You may not know how the received files will be formatted in all scenarios; therefore, analysis and preliminary transformation is often required before any major actions like partitioning happens. A directory structure similar to the following is optimal:

{location}/{subject}/{direction}/{yyyy}/{mm}/{dd}/{hh}/*

where {direction} of in contains raw data that can then be partitioned using the partitionBy() method. Once the partitioning is completed, the files can be placed in the out directory folder. The following code snippets select data based on the columns within the DataFrame:

df.write.partitionBy(‘SCENARIO’).mode(‘overwrite’)
 .parquet(‘/EMEA/brainjammer/out/2022/01/12/15/SCENARIO.parquet’)
df.write.partitionBy(‘ELECTRODE’).mode(‘overwrite’)
 .parquet(‘/EMEA/brainjammer/out/2022/01/12/15/ELECTRODE.parquet’)
df.write.partitionBy(‘FREQUENCY’).mode(‘overwrite’)
 .parquet(‘/EMEA/brainjammer/out/2022/01/12/15/FREQUENCY.parquet’)

Figure 3.12 shows the results. When the data was loaded into the DataFrame, there was a single large file. As you can infer from Figure 3.12, the data is now split into many smaller files.

FIGUER 3.12 Partitioning files

The following snippets perform a query on the partitioned data and display 10 rows:

data = spark.read .parquet(‘/EMEA/brainjammer/out/2022/01/12/15/ELECTRODE.parquet/   ELECTRODE=AF3’)data.show(10)

Consider the following recommendations:

  • The fewer the number of files that need parsing, the faster the query renders.
  • The partitioning of large data files improves performance by limiting the amount data searched.

Design a Partition Strategy for Analytical Workloads

Before you can design a partition strategy for an analytical workload, you need to have a description of what an analytical workload is. In your job or school, your workload is probably very high. What do you do to manage your own workload? One method is to make a list of all the work tasks and prioritize them. Then, you can identify an amount of time required to complete each task and add them all together, which results at some point in the future all the work being completed. Is the timeframe to complete the work acceptable? If not, you need to look a bit further into the details of each task and perhaps find a way to optimize the required actions so that they are completed more quickly.

The same principle is applied to an analytical workload. Your data analytics solution, which runs in a pipeline, performs specific tasks. If it does not perform fast enough or needs to be optimized for cost, then consider whether any of task would benefit from creating or optimizing (aka tuning) existing partitions. Data can change in format, relevance, and volume, so reviewing existing partitions on a regular basis is beneficial. The following are questions to consider and recommendations for designing analytical workloads. Each has been discussed in previous sections or chapters.

  • Which analytical stack or pool will you use?
  • Is your data from relational, semi‐, or nonstructured sources?
  • If data storage is file‐based
    • Use Parquet files and partition them.
    • Determine the optimal balance between file size and number of files for your given set of requirements.
  • If data storage is table‐based
    • Keep the law of 60 in scope.
    • Confirm that the table distribution type is still the most valid.

Leave a Reply

Your email address will not be published. Required fields are marked *