Run your own Software Heritage#
This tutorial will guide from the basic step of obtaining the source code of the Software Heritage stack to running a local copy of it with which you can archive source code and browse it on the web. To that end, just follow the steps detailed below.
Warning
Running a Software Heritage instance on your machine can consume quite a bit of resources: if you play a bit too hard (e.g., if you try to list all GitHub repositories with the corresponding lister), you may fill your hard drive, and consume a lot of CPU, memory and network bandwidth.
Dependencies#
The easiest way to run a Software Heritage instance is to use Docker. Please ensure that you have a working recent installation first (including the Compose plugin).
Quick start#
First, retrieve Software Heritage development environment to get the Docker configuration:
~$ git clone https://gitlab.softwareheritage.org/swh/devel/docker.git swh-docker
~$ cd swh-docker
Note
If you intend to hack on Software Heritage source code and test your changes with docker, you should rather follow the instructions in section Checkout the source code to install the full Software Heritage development environment that includes Docker configuration.
Then, start containers:
~/swh-docker$ docker compose up -d
[...]
Creating docker_amqp_1 ... done
Creating docker_zookeeper_1 ... done
Creating docker_kafka_1 ... done
Creating docker_flower_1 ... done
Creating docker_swh-scheduler-db_1 ... done
[...]
This will build Docker images and run them. Check everything is running fine with:
~/swh-docker$ docker compose ps
Name Command State Ports
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
docker_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp
docker_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp
docker_kafka_1 start-kafka.sh Up 0.0.0.0:5092->5092/tcp
docker_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp
docker_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp
[...]
The startup of some containers may fail the first time for
dependency-related problems. If some containers failed to start, just
run the docker compose up -d
command again.
If a container really refuses to start properly, you can check why using
the docker compose logs
command. For example:
~/swh-docker$ docker compose logs swh-lister
Attaching to docker_swh-lister_1
[...]
swh-lister_1 | Processing /src/swh-scheduler
swh-lister_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")]
swh-lister_1 |
Note
For details on the various Docker images and how to work with them, see the full Docker environment documentation.
Once all containers are running, you can use the web interface by opening
http://localhost:<nginx-port>/ in your web browser. <nginx-port>
is the
port on which nginx is exposed to the host. By default, it is randomly
attributed by docker. Use:
~/swh-docker$ docker compose port nginx 80
To find which port is actually used.
Note
Please read the “Exposed Ports” section of the README file in the swh-docker repository for more details and options on this topic.
At this point, the archive is empty and needs to be filled with some content. The simplest way to start loading software is to use the Save Code Now feature of the archive web interface:
http://localhost:<nginx-port>/browse/origin/save/
You can also use the command line interface to inject code. For example to retrieve projects hossted on the https://0xacab.org GitLab forge:
~/swh-docker$ docker compose exec swh-scheduler \
swh scheduler task add list-gitlab-full \
-p oneshot url=https://0xacab.org/api/v4
Created 1 tasks
Task 1
Next run: just now (2018-12-19 14:58:49+00:00)
Interval: 90 days, 0:00:00
Type: list-gitlab-full
Policy: oneshot
Args:
Keyword args:
url=https://0xacab.org/api/v4
This task will scrape the forge’s project list and register origins to the scheduler. This takes at most a couple of minutes.
Then, you must tell the scheduler to create loading tasks for these origins. For example, to create tasks for 100 of these origins:
~/swh-docker$ docker compose exec swh-scheduler \
swh scheduler origin schedule-next git 100
This will take a bit of time to complete.
To increase the speed at which git repositories are imported, you can
spawn more swh-loader-git
workers:
~/swh-docker$ docker compose exec swh-scheduler \
celery status
listers@50ac2185c6c9: OK
loader@b164f9055637: OK
indexer@33bc6067a5b8: OK
vault@c9fef1bbfdc1: OK
4 nodes online.
~/swh-docker$ docker compose exec swh-scheduler \
celery control pool_grow 3 -d loader@b164f9055637
-> loader@b164f9055637: OK
pool will grow
~/swh-docker$ docker compose exec swh-scheduler \
celery inspect -d loader@b164f9055637 stats | grep prefetch_count
"prefetch_count": 4
Now there are 4 workers ingesting git repositories. You can also
increase the number of swh-loader-git
containers:
~/swh-docker$ docker compose up -d --scale swh-loader=4
[...]
Creating docker_swh-loader_2 ... done
Creating docker_swh-loader_3 ... done
Creating docker_swh-loader_4 ... done
Updating the docker image#
All containers started by docker compose
are bound to a docker image
named swh/stack
including all the software components of Software
Heritage. When new versions of these components are released, the docker
image will not be automatically updated. In order to update all Software
Heritage components to their latest version, the docker image needs to
be explicitly rebuilt by issuing the following command from within the
docker
directory:
~/swh-docker$ docker build --no-cache -t swh/stack .
Monitor your local installation#
You can monitor your local installation by looking at:
http://localhost:<nginx-port>/rabbitmq to access the rabbitmq dashboard (guest/guest),
http://localhost:<nginx-port>/grafana to explore the platform’s metrics (admin/admin),
Shut down your local installation#
To shut down your SoftWare Heritage, just run:
~/swh-docker$ docker compose down
Hacking the archive#
If you want to hack the code of the Software Heritage Archive, a more involved setup is required described in the developer setup guide.