Setup on Amazon Athena¶
The pricing of Athena depends on the amount of data scanned by each query, generally at a cost of $5 per TiB of data scanned. Full pricing details are available here.
Note that because the Software Heritage Graph Dataset is available as a public dataset, you do not have to pay for the storage, only for the queries (except for the data you store on S3 yourself, like query results).
Loading the tables¶
In order to use Amazon Athena, you will first need to create an AWS account and setup billing.
You will also need to create an output S3 bucket: this is the place where Athena will store your query results, so that you can retrieve them and analyze them afterwards. To do that, go on the S3 console and create a new bucket.
Athena needs to be made aware of the location and the schema of the Parquet files available as a public dataset. Unfortunately, since Athena does not support queries that contain multiple commands, it is not as simple as pasting an installation script in the console. Instead, we provide a Python script that can be run locally on your machine, that will communicate with Athena to create the tables automatically with the appropriate schema.
To run this script, you will need to install a few dependencies on your machine:
For Ubuntu and Debian:
sudo apt install python3 python3-boto3 awscli
sudo pacman -S --needed python python-boto3 aws-cli
Once the dependencies are installed, run:
This will ask for an AWS Access Key ID and an AWS Secret Access Key in order to give Python access to your AWS account. These keys can be generated at this address.
It will also ask for the region in which you want to run the queries. We
recommand to use
us-east-1, since that’s where the public dataset is
Creating the tables¶
Download and run the Python script that will create the tables on your account:
To check that the tables have been successfully created in your account, you can open your Amazon Athena console. You should be able to select the database corresponding to your dataset, and see the tables:
From the console, once you have selected the database of your dataset, you can run SQL queries directly from the Query Editor.
Try for instance this query that computes the most frequent file names in the archive:
SELECT from_utf8(name, '?') AS name, COUNT(DISTINCT target) AS cnt FROM directory_entry_file GROUP BY name ORDER BY cnt DESC LIMIT 10;
Other examples are available in the preprint of our article: The Software Heritage Graph Dataset: Public software development under one roof.