Setup on Azure Databricks#

This tutorial will explain you how you can load the dataset in an Azure Spark cluster, and interface with it using a Python notebook in Azure Databricks.

Preliminaries#

Make sure you have:

  • familiarized yourself with the Azure Databricks Getting Started Guide

  • uploaded the dataset in the Parquet format on Azure (the most efficient place to upload it is an Azure Data Lake Storage Gen2 container).

  • created a Spark cluster in the Databricks interface and attached a Python notebook to it.

  • set the OAuth credentials in the Notebook so that your parquet files are accessible from the notebook, as described here.

To ensure that you have completed all the preliminary steps, run the following command in your Notebook:

dataset_path = 'abfss://YOUR_CONTAINER@YOUR_ACCOUNT.dfs.core.windows.net/PARQUET_FILES_PATH'
dbutils.fs.ls(dataset_path)

You should see an output like this:

[FileInfo(path='abfss://.../swh/content/', name='content/', size=0),
 FileInfo(path='abfss://.../swh/directory/', name='directory/', size=0),
 ...]

Loading the tables#

We need to load the Parquet tables as temporary views in Spark:

def register_table(table):
  abfss_path = dataset_path + '/' + table
  df = spark.read.parquet(abfss_path)
  print("Register the DataFrame as a SQL temporary view: {} (path: {})"
        .format(table, abfss_path))
  df.createOrReplaceTempView(table_name)

tables = [
  'content',
  'directory',
  'directory_entry',
  'origin',
  'origin_visit',
  'origin_visit_status',
  'release',
  'revision',
  'revision_history',
  'skipped_content',
  'snapshot',
  'snapshot_branch',
]

for table in tables:
  register_table(table)

Running queries#

You can now execute PySpark methods on the tables:

df = spark.sql("select id from origin limit 10")
display(df)

It is also possible to use the %sql magic command in the Notebook to directly preview SQL results:

%sql
select id from origin limit 10