Hacking on swh-indexer#
This tutorial will guide you through the hacking on the swh-indexer. If you do not have a local copy of the Software Heritage archive, go to the getting started tutorial.
Configuration files#
You will need the following YAML configuration files to run the swh-indexer commands:
Orchestrator at
~/.config/swh/indexer/orchestrator.yml
indexers:
mimetype:
check_presence: false
batch_size: 100
Orchestrator-text at
~/.config/swh/indexer/orchestrator-text.yml
indexers:
fossology_license:
batch_size: 10
check_presence: false
Mimetype indexer at
~/.config/swh/indexer/mimetype.yml
# storage to read sha1's metadata (path)
# storage:
# cls: local
# db: "service=swh-dev"
# objstorage:
# cls: pathslicing
# root: /home/storage/swh-storage/
# slicing: 0:1/1:5
storage:
cls: remote
url: http://localhost:5002/
indexer_storage:
cls: remote
args:
url: http://localhost:5007/
# storage to read sha1's content
# adapt this to your need
# locally: this needs to match your storage's setup
objstorage:
cls: pathslicing
slicing: 0:1/1:5
root: /home/storage/swh-storage/
destination_task: swh.indexer.tasks.SWHOrchestratorTextContentsTask
rescheduling_task: swh.indexer.tasks.SWHContentMimetypeTask
Fossology indexer at
~/.config/swh/indexer/fossology_license.yml
# storage to read sha1's metadata (path)
# storage:
# cls: local
# db: "service=swh-dev"
# objstorage:
# cls: pathslicing
# root: /home/storage/swh-storage/
# slicing: 0:1/1:5
storage:
cls: remote
url: http://localhost:5002/
indexer_storage:
cls: remote
args:
url: http://localhost:5007/
# storage to read sha1's content
# adapt this to your need
# locally: this needs to match your storage's setup
objstorage:
cls: pathslicing
slicing: 0:1/1:5
root: /home/storage/swh-storage/
workdir: /tmp/swh/worker.indexer/license/
tools:
name: 'nomos'
version: '3.1.0rc2-31-ga2cbb8c'
configuration:
command_line: 'nomossa <filepath>'
Worker at
~/.config/swh/worker.yml
task_broker: amqp://guest@localhost//
task_modules:
- swh.loader.svn.tasks
- swh.loader.tar.tasks
- swh.loader.git.tasks
- swh.storage.archiver.tasks
- swh.indexer.tasks
- swh.indexer.orchestrator
task_queues:
- swh_loader_svn
- swh_loader_tar
- swh_reader_git_to_azure_archive
- swh_storage_archive_worker_to_backend
- swh_indexer_orchestrator_content_all
- swh_indexer_orchestrator_content_text
- swh_indexer_content_mimetype
- swh_indexer_content_fossology_license
- swh_loader_svn_mount_and_load
- swh_loader_git_express
- swh_loader_git_archive
- swh_loader_svn_archive
task_soft_time_limit: 0
Database#
swh-indxer uses a database to store the indexed content. The default db is expected to be called swh-indexer-dev.
Create or add swh-dev
and swh-indexer-dev
to
the ~/.pg_service.conf
and ~/.pgpass
files, which are postgresql’s
configuration files.
Add data to local DB#
from within the swh-environment
, run the following command:
make rebuild-testdata
and fetch some real data to work with, using:
python3 -m swh.loader.git.updater --origin-url <github url>
Then you can list all content files using this script:
#!/usr/bin/env bash
psql service=swh-dev -c "copy (select sha1 from content) to stdin" | sed -e 's/^\\\\x//g'
Run the indexers#
Use the list off contents to feed the indexers with with the following command:
./list-sha1.sh | python3 -m swh.indexer.producer --batch 100 --task-name orchestrator_all
Activate the workers#
To send messages to different queues using rabbitmq (which should already be installed through dependencies installation), run the following command in a dedicated terminal:
python3 -m celery worker --app=swh.scheduler.celery_backend.config.app \
--pool=prefork \
--concurrency=1 \
-Ofair \
--loglevel=info \
--without-gossip \
--without-mingle \
--without-heartbeat 2>&1
With this command rabbitmq will consume message using the worker configuration file.
Note: for the fossology_license indexer, you need a package fossology-nomossa which is in our public debian repository.