Menu

Recommended File Types for Storage – Data Sources and Ingestion

Chapter 2 introduced the numerous file types and their use cases. If you need a refresher, go back to Chapter 2 to review. The following file formats are used most when working in the Big Data context:

  • JavaScript Object Notation (JSON)
  • Apache Parquet
  • Optimized Row Columnar (ORC)
  • Extensible Markup Language (XML)
  • Yet Another Markup Language (YAML)
  • Comma‐separated values (CSV)
  • Apache Avro (AVRO)

This code loads an existing session into a DataFrame and then creates a new DataFrame to contain the old and new headers. The new headers replace the old ones in the original DataFrame named df and then are placed into another DataFrame named dfh. That DataFrame, dfh, is then written to the /user/trusted‐service‐user directory on the originally referenced ADLS container in the first line of code. Then the new Parquet file is loaded into a new DataFrame named dfp and queried. Figure 3.5 illustrates how the Parquet files are stored in the Storage browser blade in the Azure Portal.

FIGUER 3.5 Storing Parquet files in an ADLS container

Similar views of these files—and all files in your ADLS container—are possible from numerous sources. Azure Synapse Analytics has a feature to navigate through the ADLS content, as does Azure Storage Explorer. Note that Azure Storage Explorer has a handy Download feature.

Recommended File Types for Analytical Queries

Table 3.3 provides a refresher on when to use which file types.

TABLE 3.3 File type use cases

File typeSynapse pool typeUse case
JSONSQL and SparkLarge complex datasets, using JavaScript
ParquetSparkWORM operations with Hadoop or Spark
ORCSparkApache Hive, WORM operations
XMLSQL and SparkData validation with content variety
YAMLN/APrimarily for configurations, not data
CSVSQL and SparkSmall datasets, simple, using Excel
AVROSparkWrite heavy, I/O operations, optimal for batch processing

For some clarity, use JSON when you or your team are already working heavily with JavaScript. That doesn’t mean if you use C#, Python, or Java that you shouldn’t use JSON; it means JSON is most optimal with JavaScript. Remember that WORM stands for write once, read many. Also, consider that Parquet files are much more efficient from a storage perspective, as well as from a performance perspective, when compared to JSON and CSV files. Finally, Parquet and ORC files are in columnar format, making them optimal for read operations, whereas AVRO files are row‐based, making them optimal for write operations. All three file types—Parquet, ORC, and ARVO—are not human readable, as they are stored in binary form.

Leave a Reply

Your email address will not be published. Required fields are marked *