Menu

CARTESIAN JOIN – CREATE DATABASE dbName; GO

A Cartesian JOIN, or a CROSS JOIN, renders a Cartesian product, which is a record set of two or more joined tables. The following snippet is an example of a CROSS JOIN:

SELECT [ELECTRODE], [FREQUENCY]FROM [ELECTRODE], [FREQUENCY]ORDER BY ELECTRODE

Notice that there is no JOIN condition, which results in each row in the ELECTRODE table being paired with each row in the FREQUENCY table. The output resembles the following:

+———–+———–+
| ELECTRODE | FREQUENCY |
+———–+———–+
| AF3       | THETA     |
| AF3       | ALPHA     |
| …         | …         |
| Pz        | BETA_L    |
| …         | …         |
| T8        | BETA_H    |
| T8        | GAMMA     |
+———–+———–+

The result is a table containing all electrodes with their associated frequencies.

INCOMPATIBLE JOIN

This concept is in reference to performance bottlenecks that occur with distributed tables on Azure Synapse Analytics. Remember from earlier where you learned about distributed tables and data shuffling. You learned that when tables are created you need to decide the distribution type for a table. Distribution types can be ROUND‐ROBIN, HASH, or REPLICATED, which determine how the data is placed onto the different compute nodes. You learned that shuffling happens when a query running on nodeA needs some data that is stored on nodeB, which requires the data to be moved to nodeA. The reason for that shuffle is an incompatible JOIN. Consider the following EXPLAIN query:

EXPLAINSELECT SESSION.SESSION_DATETIME, SCENARIO.SCENARIOFROM   SESSIONINNER REDISTRIBUTE JOIN SCENARIO ON SESSION.SCENARIO_ID = SCENARIO.SCENARIO_ID

In this code snippet, SCENARIO_ID is the distribution key for the SESSION table but not for the SCENARIO table. Adding the REDISTRIBUTE hint in the query forces a shuffle, which prevents latent query behaviors. Explaining to the SQL optimizer that this is an incompatible JOIN results in the data being shuffled in the most efficient way possible. The same behavior can be experienced when executing aggregate functions, which is referred to as incompatible aggregations. The difference as you can infer is that this happens when aggregate functions are used that cause the shuffle instead of commands.

MEDIAN

There is no MEDIAN function in T‐SQL, but using a rather complex query makes it possible. You have the luxury of using the PERCENTILE function in SparkSQL, which is one of numerous options. The median is different from the average in that the values at the highest and lowest spectrum of data values are removed. Those large values can have a dramatic effect on the rendered value. Consider the following SparkSQL query and result:

%%sql
SELECT AVG(_c1) AS Average, PERCENTILE(_c1, 0.5) AS Median FROM BETA_H

+———+———
| Average | Median |
+———+——–+
| 1.55598 | 1.3375 |
+———+——–+

The data that this query was run against was the brainwave value for all electrodes, the BETA_H frequency, and the ClassicalMusic scenario.

OPENROWSET

Use this function when you want to perform some ad hoc queries from a remote data source. OPENROWSET can perform SELECT, INSERT, UPDATE, and DELETE statements. The following snippet illustrates how to use this command:

SELECT Scenario, detailsFROM OPENROWSET(‘CosmosDb’,’Account=csharpguitar;Database=brainjammer;Key=sMxCN…xyQ==’,sessions)WITH (Scenario varchar(max) ‘$.Session.Scenario’,details varchar(max) ‘$.Session’) AS Sessions

Notice that the first parameter is a reference to the provider. A provider is the driver, or the coded assembly, that contains the code to make a connection to the targeted data source. In this case it is a reference to the Azure Cosmos DB you created earlier in Exercise 2.2. The value, CosmosDb, is an alias to an Azure Synapse Linked Service, which is configured with a connection string. (You have not created that connection yet, but you will in the following chapters.) Notice that the values for Account and Database match the example names provided in Exercise 2.2 as well. The third parameter, sessions, is the container ID. The WITH clause is similar to the CROSS APPLY command in that it provides the reference to an object outside the scope of the SQL query but uses it in the final rendered result set. This is a very useful command, especially when you’re working with JSON files.

Leave a Reply

Your email address will not be published. Required fields are marked *