Graph Docker environment


$ git clone
$ cd swh-graph
$ docker build --tag swh-graph dockerfiles


Given a graph g specified by:

  • g.edges.csv.gz: gzip-compressed csv file with one edge per line, as a “SRC_ID SPACE DST_ID” string, where identifiers are the Persistent identifiers of each node.
  • g.nodes.csv.gz: sorted list of unique node identifiers appearing in the corresponding g.edges.csv.gz file. The format is a gzip-compressed csv file with one persistent identifier per line.
$ docker run -ti \
    --volume /PATH/TO/GRAPH/:/srv/softwareheritage/graph/data \
    --publish \
    swh-graph:latest \

Where /PATH/TO/GRAPH is a directory containing the g.edges.csv.gz and g.nodes.csv.gz files. By default, when entering the container the current working directory will be /srv/softwareheritage/graph; all relative paths found below are intended to be relative to that dir.

Graph compression

To compress the graph:

$ app/scripts/ --lib lib/ --input data/g

Warning: very large graphs may need a bigger batch size parameter for WebGraph internals (you can specify a value when running the compression script using: --batch-size 1000000000).

Node identifier mappings

To dump the mapping files (i.e., various node id <-> other info mapping files, in either .csv.gz or ad-hoc .map format):

$ java -cp app/swh-graph.jar \
    org.softwareheritage.graph.backend.Setup \
    data/g.nodes.csv.gz data/compressed/g

Graph server

To start the swh-graph server:

$ java -cp app/swh-graph.jar \
    org.softwareheritage.graph.App data/compressed/g

To specify the port on which the server will run, use the –port or -p flag (default is 5009).