Menu
Design a Partition Strategy for Efficiency and Performance – Data Sources and Ingestion

Design a Partition Strategy for Efficiency and Performance – Data Sources and Ingestion

Both efficiency and performance were discussed in earlier sections pertaining to the design of a partition strategy. An efficient query is one in which the time required to execute it is well used. That means the query should not be waiting on data shuffling or querying irrelevant data. The most efficient query would be one […]

Design for Efficient Querying – Data Sources and Ingestion

Design for Efficient Querying – Data Sources and Ingestion

You can take numerous steps to optimize the performance and manageability of your files contained on ADLS. The following actions can improve query efficiency: Use this information as a basis for the design of your storage structure. File Size, Type, and Quantity The more data contained within a file, the larger it is and the […]

Recommended File Types for Storage – Data Sources and Ingestion

Recommended File Types for Storage – Data Sources and Ingestion

Chapter 2 introduced the numerous file types and their use cases. If you need a refresher, go back to Chapter 2 to review. The following file formats are used most when working in the Big Data context: This code loads an existing session into a DataFrame and then creates a new DataFrame to contain the […]

Design a Partition Strategy for Files – Data Sources and Ingestion

Design a Partition Strategy for Files – Data Sources and Ingestion

Having an intuitive directory structure for the ingestion of data is a prequel to implementing the partitioning strategy. You may not know how the received files will be formatted in all scenarios; therefore, analysis and preliminary transformation is often required before any major actions like partitioning happens. A directory structure similar to the following is […]

Create an Azure Data Lake Storage Container – Data Sources and Ingestion-3

Create an Azure Data Lake Storage Container – Data Sources and Ingestion-3

Some Azure products generate costs even when not actively used, whereas others do not. An empty ADLS container does not incur any costs, but one that consumes space does. You should remove resources that are no longer being used. Make sure to perform due diligence when provisioning Azure products, as you will be required to […]

SQL Server Integration Services – CREATE DATABASE dbName; GO

SQL Server Integration Services – CREATE DATABASE dbName; GO

Introduced in Chapter 1, SQL Server Integration Services (SSIS) is useful for pulling data from numerous datastores, transforming the data, and storing it in a central datastore for analysis. Ingestion can be initiated by pulling data from existing sources instead, which is in contrast to data producers pushing data into the pipeline. Bulk Copy Program […]

Upload Data to an ADLS Container – Data Sources and Ingestion

Upload Data to an ADLS Container – Data Sources and Ingestion

FIGUER 3.3 The Upload folder in Azure Storage Explorer Refer to Table 3.1 and try to determine what kind of ingestion this was. Did you use the correct processing service for the ingestion type? Yes, you did. What you just performed was an ad hoc upload. One of the recommended tools is noted as Azure […]

CONVERT and CAST – CREATE DATABASE dbName; GO

CONVERT and CAST – CREATE DATABASE dbName; GO

CONVERT and CAST are essentially the same—there is no difference between their capabilities or performance. They both exist solely for historical reasons, not for any functional ones. As long as you understand that both of these SQL functions are used to change the data type of data stored in a table, you have this one […]

JOIN – CREATE DATABASE dbName; GO

JOIN – CREATE DATABASE dbName; GO

This is a relational structure‐oriented concept that has to do with querying data that exists in two or more tables using a single query. It is possible to use JOINs on NoSQL data, but the nature of nonstructured or semi‐structured means it won’t be a very performant experience. If the JOIN on non‐ or unstructured […]

CARTESIAN JOIN – CREATE DATABASE dbName; GO

CARTESIAN JOIN – CREATE DATABASE dbName; GO

A Cartesian JOIN, or a CROSS JOIN, renders a Cartesian product, which is a record set of two or more joined tables. The following snippet is an example of a CROSS JOIN: SELECT [ELECTRODE], [FREQUENCY]FROM [ELECTRODE], [FREQUENCY]ORDER BY ELECTRODE Notice that there is no JOIN condition, which results in each row in the ELECTRODE table […]