The training and enrichment of data typically happens by making improvements to the data quality or invoking Azure Machine Learning models, which can be later consumed by Azure Cognitive Services. The invoke can take place within a pipeline or manually. Azure Machine Learning models can be used to predict future outcomes based on historical trends identified in the data. Azure Cognitive Services can be used to analyze audio files, images, and videos, which might result in findings that can be logged as data to be used later in queries. There is more about this phase of the Big Data pipeline in Chapter 5, “Transform, Manage, and Prepare Data.”
Store
Where is the data stored? Rule number one is that it needs to be in a secure place. It also needs to be compliant with regional governmental requirements. If you run a business in the European Union, you might not want your data to be stored physically in another region or even another country if you have a local business. How long are you obligated to store data and what is the maximum legal amount of time you can store social media data? Through the pipeline you will need to store data temporarily as it passes through the numerous stages, and then you will need a final location to store the output. Therefore, you have at least three places to keep your data secure and available: at its original source location, in any staging or temporary location, and at its final location. The final location is where authorized clients will consume it and make business decisions from it. Azure Data Lake Storage (ADLS) is where a lot of your storage will happen. Azure Synapse Analytics, Azure Data Factory, and other analytics products provide in‐memory or temporary table stores for housing data that is being transformed. The final location for data can be in a SQL data warehouse (aka dedicated SQL pool), Azure SQL database, Azure Cosmos DB, or any secure and/or compliant location.
Model, Serve
This is the final stage and it’s where the fruits of your labor are realized. Look back at Figure 2.29 and you will see that individuals or partners can access the data via a data share. Additionally, the data can be stored on Azure SQL, which Power BI can connect to for additional modeling and manipulation. Or the data can be sent to or exposed via an API and accessed by authorized computer applications.
ETL, ELT, ELTL
Both extract, transform, and load (ETL) and extract, load, and transform (ELT) are processes that take raw data from a data source and place it into a data target. Those processes overlap the Big Data stages described earlier. They do, however, influence the order in which the stages occur. Both ETL and ELT are technical jargon that helps better describe the approach you are taking to implement your data analytics solution. The ETL process is best applied to relational data sources, whereas ELT has some better features to handle unstructured data. ELT will perform faster because it does not need to worry about managing any constraints found in a relational data model. You get faster speed of data transformation, but the data can be jumbled, and the process might fail at some point for numerous reasons, like an unexpected data type. If speed is critical and datasets are large and frequently updated, then ELT is the best option. ETL is most useful when pulling data from numerous sources that need to be pooled and stored into a single, queryable location. The extract part comes first, then if datasets are large, unstructured, and frequent, you load it and then transform it. Otherwise, if the data is structured and changes less frequently, then after extraction, you transform the data and then load it onto a place for consumption.
There is a newer concept making some headway called extract, load, transform, and load (ELTL), which is likely self‐explanatory. Based on the first three letters, you know the scenario consists of unstructured and large datasets that change frequently. The addition of the load at the end indicates that the data can ultimately be stored, analyzed, and consumed in the same manner as relational data.