You can take numerous steps to optimize the performance and manageability of your files contained on ADLS. The following actions can improve query efficiency:
- File size, type, and quantity
- Directory structure
- Partitioning
- Designing for read operations
Use this information as a basis for the design of your storage structure.
File Size, Type, and Quantity
The more data contained within a file, the larger it is and the longer it takes to parse it. The most efficient size of file has a lot to do with the platform on which you will perform your analytics, i.e., SQL or Spark pools. Some characteristics impact both:
- Many small files can negatively impact performance.
- File should be at least 4 MB due to ADLS transaction costs. Read and write procedures are charged in 4 MB increments. If the operation takes place on a file 4 KB in size, you still get charged for 4 MB.
Other characteristics are pool specific:
- SQL pool
- A file size between 100 MB and 10 GB is optimal.
- Convert large CSV or JSON files to Parque.
- When using OPENROWSET, have equally sized files.
- Spark pool
- A file size between 256 MB and 100 GB is optimal.
- Use Parquet files as often as possible.
The snippet shown earlier that converts a CSV file to Parquet might be useful for managing file size. Instead of loading only a single file into a DataFrame, load 10 or 20 files, and then write all that data to a single Parquet file. That single file would then be the size of all the others combined.
Directory Structure
The structuring of your files into directories has an impact on both manageability and performance. Figure 3.6 illustrates the folder structure for the brainjammer data. The directory is structured in a way that makes it relatively intuitive where data is located. If you want CSV, JSON, or Parquet files, or if you want MetalMusic brain wave sessions in JSON format, they are easy to find. This means that anyone can find the desired files without documentation or training, making the directory structure manageable and well designed.
FIGUER 3.6 The brainjammer directory structure
From a performance perspective, note that the file names contain the MODE (EEG or POW) and the SCENARIO. Notice, also, that the JSON‐formatted files do not contain the SCENARIO. Assume that all JSON data files, instead of being stored into a well‐designed directory structure, are placed into a single folder. You would then need to query all the files to find which were related to a specific SCENARIO. That would not be as performant as querying a smaller group of files that are grouped logically together. An alternative and very common directory structure for managing files contains dates. Consider the following directory template:
{location}/{subject}/{direction}/{yyyy}/{mm}/{dd}/{hh}/*
Here are a few examples of how this might look:
EMEA/brainjammer/in/2022/01/07/09/
NA/brainjammer/out/2022/01/07/10/
The {location} identifies the physical location of the data, followed by an identifier or {subject} of what the data is. The {direction} folder can help identify the state of the data. For example, {in} might symbolize that the data is raw and needs to be read and processed, whereas {out} means the data has been processed at least once. The folders that identify the file ingestion, {yyyy}/{mm}/{dd}/{hh}, give you a well‐designed directory structure. Being able to query data in a constrained context, such as year, month, day, and hour, improves performance when compared to parsing all files for a given year and month.