There was a lot covered in this chapter, starting with the description of the first data storage device up to real‐time data analysis and intelligence gathering. You learned about the different file formats, like JSON, CSV, and Parquet. You learned about the different ways in which data can be stored, like structured, semi‐structured, and nonstructured. You learned about the many types of data, such as characters and numbers, and data storage concepts like sharding, indexes, and partitions. Data shuffling, data skew, distributed tables, schemas, and table views should all have a place in your mind now.
There are numerous methods for interfacing with data on the Azure platform. When using an Apache Spark pool, you can use PySpark and DataFrames to manipulate and store data. SparkSQL is a familiar syntax for those who are more comfortable with T‐SQL or other SQL variants. You learned about the DBCC and DDL capabilities and a lot about some mid‐level complex DML commands, clauses, and functions. Syntax such as COPY INTO, CROSS APPLY, OVER, INTERSECT, OPENJSON, and WITH should be tightly bound into your data analysis repertoire. And finally, you learned some introductory concepts and theory regarding Big Data processing stages, analytical types, and layers. Ingest, transform, train, store, and serve are the primary Big Data stages. You should have a good understanding of which Azure products are used for each of those stages.
Exam Essentials
Know how data is structured. Relational databases, document databases, and blob storage are ways to store the three primary data structure types. Azure SQL is useful for structured, relational data. Azure Cosmos DB is useful for semi‐structured document data, and blob storage is useful for nonstructured data. Remember that ADLS is built on top of blob storage and is most useful for data analytic solutions.
Describe the various kinds of distributed tables. Distributed tables in Azure Synapse Analytics come in three varieties: round‐robin, hash, and replicated. Round‐robin is the default and means that data is randomly placed across the nodes in your cluster. Hash means that the data is placed across the nodes based on a key that groups similar kinds of data together. Replicated means that all the data is placed onto all the nodes. This type will consume the most amount of storage space.
Be able to name the table categories. Staging tables, fact tables, dimension tables, temporary tables, sink tables, and external tables are all necessary to run an enterprise‐level data analytics process.
Understand how to use PySpark, DataFrames, and SparkSQL. Although there is no coding exercise on the exam, you will need to review code snippets and determine which one is the most optimal. There might even be small syntactical mistakes that are hard to recognize. Perform the exercises in this book and read through the examples provided in this chapter so that you can recognize common coding and SQL syntax for running data analytics on Apache Spark.
Know how to use DBCC, DDL, and DML. An Azure data engineer needs to know SQL syntax. You must understand the basics as well as the mid‐level and advanced commands, clauses, and functions. You will need to know the less common, yet very powerful, SQL syntax for performing sophisticated analytics. SQL syntax is much more advanced than beginner.
Be able to define Big Data. Know the stages, the Azure products, and the analytics types (descriptive, predictive, and prescriptive). Although data layers overlap the stages, they are still used in many data analytics solutions, and therefore, you need to know them well.