Quickstart#
This quick tutorial shows how to start the swh.graph
service to query
an existing compressed graph with the high-level HTTP API.
Dependencies#
In order to run the swh.graph
tool, you will need Python (>= 3.7), Java JRE,
Rust (>= 1.75), and zstd. On a Debian system:
$ sudo apt install build-essential libclang-dev python3 python3-venv default-jre zstd protobuf-compiler
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Rustup will ask you a few questions, you can pick the defaults. Or select
“Customize installation” and to install a minimal
distribution if you are
not planning to edit the code and want to save some disk space.
Installing swh.graph#
Install the swh_graph
rust package:
$ cargo install --git https://gitlab.softwareheritage.org/swh/devel/swh-graph.git --features grpc-server swh-graph
Or:
$ git clone https://gitlab.softwareheritage.org/swh/devel/swh-graph.git
$ cd swh-graph
$ cargo build --features grpc-server -p swh-graph
You now have a debug build of the gRPC server. (Use --release
on the last command
of each option, and the RUSTFLAGS="-C target-cpu=native"
env var for a release build.)
Create a virtualenv and activate it:
$ python3 -m venv .venv
$ source .venv/bin/activate
Install the swh.graph
python package:
(venv) $ pip install swh.graph
[...]
(venv) $ swh graph --help
Usage: swh graph [OPTIONS] COMMAND [ARGS]...
Software Heritage graph tools.
Options:
-C, --config-file FILE YAML configuration file
-h, --help Show this message and exit.
Commands:
compress Compress a graph using WebGraph
download Downloads a compressed SWH graph to the given target directory
grpc-serve start the graph GRPC service
luigi Calls Luigi with the given task and params, and...
rpc-serve run the graph RPC service
Alternatively, if you want to edit the swh-graph code, use these commands:
# get the code
$ git clone https://gitlab.softwareheritage.org/swh/devel/swh-graph.git
$ cd swh-graph
# build Rust backend (only if you need to modify the Rust code,
# or did not run `cargo install` above)
$ cargo build --release --features grpc-server -p swh-graph
# build Java backend (only if you need to load graphs created before 2024)
$ make java
# install Python package
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -e .
Retrieving a compressed graph#
Software Heritage provides a list of off-the-shelf datasets that can be used
for various research or prototyping purposes. Most of them are available in
compressed representation, i.e., in a format suitable to be loaded and
queried by the swh-graph
library.
All the publicly available datasets are documented on this page: https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html
A good way of retrieving these datasets is to use the AWS S3 CLI.
Here is an example with the dataset 2021-03-23-popular-3k-python
, which has
a relatively reasonable size (~15 GiB including property data, with
the compressed graph itself being less than 700 MiB):
(venv) $ swh graph download --name 2021-03-23-popular-3k-python 2021-03-23-popular-3k-python/compressed
You can also retrieve larger graphs, but note that these graphs are generally intended to be loaded fully in RAM, and do not fit on ordinary desktop machines. The server we use in production to run the graph service has more than 700 GiB of RAM. These memory considerations are discussed in more details in Memory & Performance tuning.
Note
For testing purposes, a synthetic test dataset
is available in the swh-graph
repository,
with just a few dozen nodes. Its basename is
swh-graph/swh/graph/example_dataset/compressed/example
.
API server#
To start a swh.graph
API server of a compressed graph dataset, you need to
use the rpc-serve
command with the basename of the graph, which is the path prefix
of all the graph files (e.g., with the basename compressed/graph
, it will
attempt to load the files located at
compressed/graph.{graph,properties,offsets,...}
.
In our example:
(venv) $ swh graph rpc-serve -g compressed/graph
Started GRPC using dataset from swh/graph/example_dataset/compressed/example
['/home/dev/.cargo/bin/swh-graph-grpc-serve', '-vv', '--bind', '[::]:50867', 'compressed/graph']
INFO:swh.graph.grpc_server:Starting gRPC server: /home/dev/.cargo/bin/swh-graph-grpc-serve -vv --bind '[::]:50867' compressed/graph
2024-06-18T09:12:40+02:00 - INFO - Loading graph
2024-06-18T09:12:40+02:00 - INFO - Loading properties
2024-06-18T09:12:40+02:00 - INFO - Loading labels
2024-06-18T09:12:40+02:00 - INFO - Starting server
======== Running on http://0.0.0.0:5009 ========
(Press CTRL+C to quit)
If you get any error about a missing file .cmph
, .bin
, .bits
, .ef
file (typically for graphs before 2024), you need to generate it with:
swh graph reindex compressed/graph
If instead you get an error about an invalid hash in a .ef
file, it means your
swh-graph expects a different version of the .ef
files as the one you have locally.
You need to regenerate them for your version:
swh graph reindex --ef compressed/graph
Then try again.
From there you can use this endpoint to query the compressed graph, for example
with httpie (sudo apt install httpie
):
~/tmp$ http :5009/graph/leaves/swh:1:dir:432d1b21c1256f7408a07c577b6974bbdbcc1323
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Tue, 15 Sep 2020 08:35:19 GMT
Server: Python/3.8 aiohttp/3.6.2
Transfer-Encoding: chunked
swh:1:cnt:33af56e02dd970873d8058154bf016ec73b35dfb
swh:1:cnt:b03b4ffd7189ae5457d8e1c2ee0490b1938fd79f
swh:1:cnt:74d127c2186f7f0e8b14a27249247085c49d548a
swh:1:cnt:c0139aa8e79b338e865a438326629fa22fa8f472
[...]
swh:1:cnt:a6b60e797063fef707bbaa4f90cfb4a2cbbddd4a
swh:1:cnt:cc0a1deca559c1dd2240c08156d31cde1d8ed406
See the documentation of the API for more details on how to use the HTTP graph querying API.