Setup on a PostgreSQL instance

This tutorial will guide you through the steps required to setup the Software Heritage Graph Dataset in a PostgreSQL database.

PostgreSQL local setup

You need to have access to a running PostgreSQL instance to load the dataset. This section contains information on how to setup PostgreSQL for the first time.

If you already have a PostgreSQL server running on your machine, you can skip to the next section.

  • For Ubuntu and Debian:

    sudo apt install postgresql
    
  • For Archlinux:

    sudo pacman -S --needed postgresql
    sudo -u postgres initdb -D '/var/lib/postgres/data'
    sudo systemctl enable --now postgresql
    

Once PostgreSQL is running, you also need an user that will be able to create databases and run queries. The easiest way to achieve that is simply to create an account that has the same name as your username and that can create databases:

sudo -u postgres createuser --createdb $USER

Retrieving the dataset

You need to download the dataset in SQL format. Use the following command on your machine, after making sure that it has enough available space for the dataset you chose:

mkdir swhgd && cd swhgd
wget -c -q --show-progress -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/sql/
mkdir popular-4k && cd popular-4k
wget -c -q --show-progress -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-4k/sql/
mkdir popular-3k-python && cd popular-3k-python
wget -c -q --show-progress -A gz,sql -nd -r -np -nH https://annex.softwareheritage.org/public/dataset/graph/2019-01-28/popular-3k-python/sql/

Loading the dataset

Once you have retrieved the dataset of your choice, create a database that will contain it, and load the database:

createdb swhgd
psql swhgd < load.sql
createdb swhgd-popular-4k
psql swhgd-popular-4k < load.sql
createdb swhgd-popular-3k-python
psql swhgd-popular-3k-python < load.sql

You can now run SQL queries on your database. Run psql <database_name> to start an interactive PostgreSQL console.