Setup on Amazon Athena

The Software Heritage Graph Dataset is available as a public dataset in Amazon Athena. Athena uses presto, a distributed SQL query engine, to automatically scale queries on large datasets.

The pricing of Athena depends on the amount of data scanned by each query, generally at a cost of $5 per TiB of data scanned. Full pricing details are available here.

Note that because the Software Heritage Graph Dataset is available as a public dataset, you do not have to pay for the storage, only for the queries (except for the data you store on S3 yourself, like query results).

Loading the tables

AWS account

In order to use Amazon Athena, you will first need to create an AWS account and setup billing.

You will also need to create an output S3 bucket: this is the place where Athena will store your query results, so that you can retrieve them and analyze them afterwards. To do that, go on the S3 console and create a new bucket.

Setup

Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting an installation script in the console. Instead, we provide a Python script that can be run locally on your machine, that will communicate with Athena to create the tables automatically with the appropriate schema.

To run this script, you will need to install a few dependencies on your machine:

  • For Ubuntu and Debian:

    sudo apt install python3 python3-boto3 awscli
    
  • For Archlinux:

    sudo pacman -S --needed python python-boto3 aws-cli
    

Once the dependencies are installed, run:

aws configure

This will ask for an AWS Access Key ID and an AWS Secret Access Key in order to give Python access to your AWS account. These keys can be generated at this address.

It will also ask for the region in which you want to run the queries. We recommand to use us-east-1, since that’s where the public dataset is located.

Creating the tables

Download and run the Python script that will create the tables on your account:

wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py
python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/'
wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py
python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular4k -l 's3://softwareheritage/teasers/popular-4k'
wget https://annex.softwareheritage.org/public/dataset/graph/latest/athena/athena.py
python3 athena.py -o 's3://YOUR_OUTPUT_BUCKET/' -d popular3kpython -l 's3://softwareheritage/teasers/popular-3k-python'

To check that the tables have been successfully created in your account, you can open your Amazon Athena console. You should be able to select the database corresponding to your dataset, and see the tables:

../../_images/athena_tables.png

Running queries

From the console, once you have selected the database of your dataset, you can run SQL queries directly from the Query Editor.

Try for instance this query that computes the most frequent file names in the archive:

SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt
FROM directory_entry_file
GROUP BY name
ORDER BY cnt DESC
LIMIT 10;

Other examples are available in the preprint of our article: The Software Heritage Graph Dataset: Public software development under one roof.