Recommended File Types for Storage – Data Sources and Ingestion

Jas Moore Updated on 08/03/202404/07/2024

Chapter 2 introduced the numerous file types and their use cases. If you need a refresher, go back to Chapter 2 to review. The following file formats are used most when working in the Big Data context:

JavaScript Object Notation (JSON)
Apache Parquet
Optimized Row Columnar (ORC)
Extensible Markup Language (XML)
Yet Another Markup Language (YAML)
Comma‐separated values (CSV)
Apache Avro (AVRO)

This code loads an existing session into a DataFrame and then creates a new DataFrame to contain the old and new headers. The new headers replace the old ones in the original DataFrame named df and then are placed into another DataFrame named dfh. That DataFrame, dfh, is then written to the /user/trusted‐service‐user directory on the originally referenced ADLS container in the first line of code. Then the new Parquet file is loaded into a new DataFrame named dfp and queried. Figure 3.5 illustrates how the Parquet files are stored in the Storage browser blade in the Azure Portal.

FIGUER 3.5 Storing Parquet files in an ADLS container

Similar views of these files—and all files in your ADLS container—are possible from numerous sources. Azure Synapse Analytics has a feature to navigate through the ADLS content, as does Azure Storage Explorer. Note that Azure Storage Explorer has a handy Download feature.

Recommended File Types for Analytical Queries

Table 3.3 provides a refresher on when to use which file types.

TABLE 3.3 File type use cases

File type	Synapse pool type	Use case
JSON	SQL and Spark	Large complex datasets, using JavaScript
Parquet	Spark	WORM operations with Hadoop or Spark
ORC	Spark	Apache Hive, WORM operations
XML	SQL and Spark	Data validation with content variety
YAML	N/A	Primarily for configurations, not data
CSV	SQL and Spark	Small datasets, simple, using Excel
AVRO	Spark	Write heavy, I/O operations, optimal for batch processing

For some clarity, use JSON when you or your team are already working heavily with JavaScript. That doesn’t mean if you use C#, Python, or Java that you shouldn’t use JSON; it means JSON is most optimal with JavaScript. Remember that WORM stands for write once, read many. Also, consider that Parquet files are much more efficient from a storage perspective, as well as from a performance perspective, when compared to JSON and CSV files. Finally, Parquet and ORC files are in columnar format, making them optimal for read operations, whereas AVRO files are row‐based, making them optimal for write operations. All three file types—Parquet, ORC, and ARVO—are not human readable, as they are stored in binary form.

Recommended File Types for Analytical Queries

Leave a Reply Cancel reply

Recommended Articles