.. highlight:: bash
.. _getting-started:
Run your own Software Heritage
==============================
This tutorial will guide from the basic step of obtaining the source code of
the Software Heritage stack to running a local copy of it with which you can
archive source code and browse it on the web. To that end, just follow the
steps detailed below.
.. warning::
Running a Software Heritage instance on your machine can
consume quite a bit of resources: if you play a bit too hard (e.g., if
you try to list all GitHub repositories with the corresponding lister),
you may fill your hard drive, and consume a lot of CPU, memory and
network bandwidth.
Dependencies
------------
The easiest way to run a Software Heritage instance is to use Docker.
Please `ensure that you have a working recent installation first
`_ (including the
`Compose `_ plugin).
Quick start
-----------
First, retrieve Software Heritage development environment to get the
Docker configuration:
.. code-block:: console
~$ git clone https://gitlab.softwareheritage.org/swh/devel/docker.git swh-docker
~$ cd swh-docker
.. note::
If you intend to hack on Software Heritage source code and test your changes with docker,
you should rather follow the instructions in section :ref:`checkout-source-code` to
install the full Software Heritage development environment that includes Docker configuration.
Then, start containers:
.. code-block:: console
~/swh-docker$ docker compose up -d
[...]
Creating docker_amqp_1 ... done
Creating docker_zookeeper_1 ... done
Creating docker_kafka_1 ... done
Creating docker_flower_1 ... done
Creating docker_swh-scheduler-db_1 ... done
[...]
This will build Docker images and run them. Check everything is running
fine with:
.. code-block:: console
~/swh-docker$ docker compose ps
Name Command State Ports
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
docker_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp
docker_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp
docker_kafka_1 start-kafka.sh Up 0.0.0.0:5092->5092/tcp
docker_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp
docker_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp
[...]
The startup of some containers may fail the first time for
dependency-related problems. If some containers failed to start, just
run the ``docker compose up -d`` command again.
If a container really refuses to start properly, you can check why using
the ``docker compose logs`` command. For example:
.. code-block:: console
~/swh-docker$ docker compose logs swh-lister
Attaching to docker_swh-lister_1
[...]
swh-lister_1 | Processing /src/swh-scheduler
swh-lister_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")]
swh-lister_1 |
.. note::
For details on the various Docker images and how to work with them,
see the full :ref:`docker-environment` documentation.
Once all containers are running, you can use the web interface by opening
http://localhost:/ in your web browser. ```` is the
port on which nginx is exposed to the host. By default, it is randomly
attributed by docker. Use:
.. code-block:: console
~/swh-docker$ docker compose port nginx 80
To find which port is actually used.
.. note::
Please read the "Exposed Ports" section of the README file in the
`swh-docker`_ repository for more details and options on this topic.
.. _`swh-docker`: https://gitlab.softwareheritage.org/swh/devel/docker.git
At this point, the archive is empty and needs to be filled with some
content. The simplest way to start loading software is to use the
*Save Code Now* feature of the archive web interface:
http://localhost:/browse/origin/save/
You can also use the command line interface to inject code. For
example to retrieve projects hossted on the https://0xacab.org GitLab forge:
.. code-block:: console
~/swh-docker$ docker compose exec swh-scheduler \
swh scheduler task add list-gitlab-full \
-p oneshot url=https://0xacab.org/api/v4
Created 1 tasks
Task 1
Next run: just now (2018-12-19 14:58:49+00:00)
Interval: 90 days, 0:00:00
Type: list-gitlab-full
Policy: oneshot
Args:
Keyword args:
url=https://0xacab.org/api/v4
This task will scrape the forge’s project list and register origins to the scheduler.
This takes at most a couple of minutes.
Then, you must tell the scheduler to create loading tasks for these origins.
For example, to create tasks for 100 of these origins:
.. code-block:: console
~/swh-docker$ docker compose exec swh-scheduler \
swh scheduler origin schedule-next git 100
This will take a bit of time to complete.
To increase the speed at which git repositories are imported, you can
spawn more ``swh-loader-git`` workers:
.. code-block:: console
~/swh-docker$ docker compose exec swh-scheduler \
celery status
listers@50ac2185c6c9: OK
loader@b164f9055637: OK
indexer@33bc6067a5b8: OK
vault@c9fef1bbfdc1: OK
4 nodes online.
~/swh-docker$ docker compose exec swh-scheduler \
celery control pool_grow 3 -d loader@b164f9055637
-> loader@b164f9055637: OK
pool will grow
~/swh-docker$ docker compose exec swh-scheduler \
celery inspect -d loader@b164f9055637 stats | grep prefetch_count
"prefetch_count": 4
Now there are 4 workers ingesting git repositories. You can also
increase the number of ``swh-loader-git`` containers:
.. code-block:: console
~/swh-docker$ docker compose up -d --scale swh-loader=4
[...]
Creating docker_swh-loader_2 ... done
Creating docker_swh-loader_3 ... done
Creating docker_swh-loader_4 ... done
Updating the docker image
-------------------------
All containers started by ``docker compose`` are bound to a docker image
named ``swh/stack`` including all the software components of Software
Heritage. When new versions of these components are released, the docker
image will not be automatically updated. In order to update all Software
Heritage components to their latest version, the docker image needs to
be explicitly rebuilt by issuing the following command from within the
``docker`` directory:
.. code-block:: console
~/swh-docker$ docker build --no-cache -t swh/stack .
Monitor your local installation
-------------------------------
You can monitor your local installation by looking at:
- http://localhost:/rabbitmq to access the rabbitmq dashboard (guest/guest),
- http://localhost:/grafana to explore the platform's metrics (admin/admin),
Shut down your local installation
---------------------------------
To shut down your SoftWare Heritage, just run:
.. code-block:: console
~/swh-docker$ docker compose down
Hacking the archive
-------------------
If you want to hack the code of the Software Heritage Archive, a more involved
setup is required described in the :ref:`developer setup
guide `.