Once data is ingested and initially stored into what is commonly referred to as a data landing zone (DLZ), the data will flow through the other Big Data stages. More, in‐depth detail about the Big Data transformation stage is covered in Part III, “Develop Data Processing.” For now, only the file structure to support data transformation is important. The data is stored in the Raw File zone immediately after ingestion. Data in the Raw File zone means it is neither in a state necessary for data analysis and insights gathering nor does it have an enforced schema. The data in a raw state might not even be in a state that is query‐able at all. Table 3.4 describes the data landing zones.
TABLE 3.4 Data landing zones
Zone name | Description |
Raw File, raw, bronze | Data is in its original state after initial ingestion. |
Cleansed Data, enriched, silver | Data is in a queryable state. |
Business Data, workspace, gold | Data is ready for analytics and insights gathering. |
You might notice that there are numerous names for each zone. There is no universal industry guideline for the zone names, so there are a few. The point is that you give each zone a name that identifies the state of the data. Figure 3.7 shows an example of how raw files might end up in an ADLS container. This data could be pushed by an authorized client or pulled from a data source using an Azure function, for example.
FIGUER 3.7 The brainjammer raw‐files directory
Note in Figure 3.7 the variety of file formats in which brainjammer sessions and modes (i.e., EEG and POW) are captured and stored. In addition to the file format variety, notice the directory structure and names. Simply by looking at the directory structure, you can make some conclusions about the state of the data within it. The next iteration of transformation might look like something in Figure 3.8.
FIGUER 3.8 The brainjammer cleansed‐data directory
Notice that there are fewer files. It is safe to conclude that a process or procedure took place that analyzed all the files in the raw‐files directory. The procedure likely grouped and sorted the data, by file format type, into single files. This can be a very complicated activity, functionally speaking; this step requires not only technical knowledge but also significant experience with the data being transformed. It is outside the scope of this book to attempt a conversion from EEG brain waves to POW brain waves. In order to make that conversion, you would need to have an in‐depth understanding of the device that captured the brain waves as well as standard brain functions. That is why, in this example, those files were not merged. Once the process completes, the transformed files are stored in the cleansed‐data directory and are now considered query ready. The final step in the workflow (aka data flow) would be to get the data files into the most optimal form for performing data analytics and business insights gathering (Figure 3.9).
FIGUER 3.9 The brainjammer business‐data directory
Notice again that there are fewer files in the business‐data directory—in this case, a single file per brain wave mode. Keep in mind that the number of files depends on file type and analytics stack on which the analysis will be performed. Since these files are in Parquet format, it means they will be analyzed using a Spark pool, which has a recommended file size of between 256 MB and 100 GB. You would create as many or as few files that work best in your scenario. Data in this DLZ is considered ready for reporting. A final point is that some large enterprise implementations would place each DLZ into a different ADLS container or even different Azure storage accounts. This would be done for better isolation, redundancy, to manage growth, to better align with team roles and responsibilities, or for compliance reasons. Just keep in mind that your design is not confined to a single ADLS container in a single location.