Chapter 2 introduced the numerous file types and their use cases. If you need a refresher, go back to Chapter 2 to review. The following file formats are used most when working in the Big Data context:
- JavaScript Object Notation (JSON)
- Apache Parquet
- Optimized Row Columnar (ORC)
- Extensible Markup Language (XML)
- Yet Another Markup Language (YAML)
- Comma‐separated values (CSV)
- Apache Avro (AVRO)
This code loads an existing session into a DataFrame and then creates a new DataFrame to contain the old and new headers. The new headers replace the old ones in the original DataFrame named df and then are placed into another DataFrame named dfh. That DataFrame, dfh, is then written to the /user/trusted‐service‐user directory on the originally referenced ADLS container in the first line of code. Then the new Parquet file is loaded into a new DataFrame named dfp and queried. Figure 3.5 illustrates how the Parquet files are stored in the Storage browser blade in the Azure Portal.
FIGUER 3.5 Storing Parquet files in an ADLS container
Similar views of these files—and all files in your ADLS container—are possible from numerous sources. Azure Synapse Analytics has a feature to navigate through the ADLS content, as does Azure Storage Explorer. Note that Azure Storage Explorer has a handy Download feature.
Recommended File Types for Analytical Queries
Table 3.3 provides a refresher on when to use which file types.
TABLE 3.3 File type use cases
File type | Synapse pool type | Use case |
JSON | SQL and Spark | Large complex datasets, using JavaScript |
Parquet | Spark | WORM operations with Hadoop or Spark |
ORC | Spark | Apache Hive, WORM operations |
XML | SQL and Spark | Data validation with content variety |
YAML | N/A | Primarily for configurations, not data |
CSV | SQL and Spark | Small datasets, simple, using Excel |
AVRO | Spark | Write heavy, I/O operations, optimal for batch processing |
For some clarity, use JSON when you or your team are already working heavily with JavaScript. That doesn’t mean if you use C#, Python, or Java that you shouldn’t use JSON; it means JSON is most optimal with JavaScript. Remember that WORM stands for write once, read many. Also, consider that Parquet files are much more efficient from a storage perspective, as well as from a performance perspective, when compared to JSON and CSV files. Finally, Parquet and ORC files are in columnar format, making them optimal for read operations, whereas AVRO files are row‐based, making them optimal for write operations. All three file types—Parquet, ORC, and ARVO—are not human readable, as they are stored in binary form.