Configuration#
The configuration for the Software Heritage Filesystem resides in the
swh > fuse
section of the shared YAML configuration
file used by all Software Heritage tools, located by default at
~/.config/swh/global.yml
(following the XDG Base Directory specification).
You can override this path on the command line via the
-C/--config-file
flag.
You can choose how swh-fuse
will fetch content from the archive.
The default and simplest way is to query the SWH public API.
This method can be configured with the following block:
web-api
:url
: archive API URL (https://archive.softwareheritage.org/api/1/)auth-token
: (optional, but recommended) authentication token used with the API URL
swh-fuse
will also search for the following fields:
cache
: a section that can contain:metadata
: a dict configuring where to store the metadata cache. It can either contain anin-memory
boolean entry, set totrue
, or apath
string entry, pointing to the file.blob
: a dict configuring where to store the blob cache, with the same entries asmetadata
. If the dict contains abypass
entry set totrue
, this cache will be disabled entirely - this can be useful in the HPC setting (see below).direntry
: how much memory should be used by the direntry cache, specified using amaxram
entry (either as a percentage of available RAM, or with disk storage unit suffixes:B
,KB
,MB
,GB
).
json-indent
: number of spaces used to print JSON metadata files. Setting it tonull
disables indentation.
Example#
Here is a full ~/.config/swh/global.yml
equivalent to the default configuration:
swh:
fuse:
cache:
metadata:
path: "/home/user/.cache/swh/fuse/metadata.sqlite"
blob:
path: "/home/user/.cache/swh/fuse/blob.sqlite"
direntry:
maxram: "10%"
web-api:
url: "https://archive.softwareheritage.org/api/1/"
auth-token: "eyJhbGciOiJIUzI1NiIsInR5cCIgOiAiSldUIiwia2..."
json-indent: 2
Logging#
The default logging level is set to INFO
and can be configured with the
SWH_LOG_LEVEL
environment variable, or through the
shared command line interface via the -l/--log-level
flag.
$ swh --log-level swh.fuse:DEBUG fs mount swhfs/ -f
Monitoring#
swh-fuse
sends statsd metrics
to localhost:8125
by default.
This can be changed from environment variables,
in particular STATSD_HOST
and STATSD_PORT
.
Expect the following metrics:
swh_fuse.graph_response_time
a timer measuring how long we are waiting for the graph backendswh_fuse.storage_response_time
a timer measuring how long we are waiting for the storage backendswh_fuse.objstorage_response_time
a timer measuring how long we are waiting for the objstorage (contents) backend
Those can also be aggregated to show the number of requests made to each backend.
Faster file system traversal with a local compressed graph#
In order to traverse the folder hierarchy much faster,
connect to a compressed graph
via its gRPC API.
To do so, install with the hpc
dependency group:
$ pip install swh-fuse[hpc]
Then, this can be enabled with the following configuration section:
graph
:grpc-url
: URL to the graph’s gRPC server.
If that server instance will only be used for swh-fuse
,
since version 6.7.2 of swh-graph
you can use the --direction=forward
option when starting the gRPC server
and you do not need any graph*transposed*
files.
Note
If you don’t need to read revision and releases information (that we usually put in
meta.json
),
then you also do not need to download/store the whole compressed graph.
The following files are enough, halving the required storage:
graph.ef
graph.graph
graph-labelled.ef
graph-labelled.labeloffsets
graph-labelled.labels
graph-labelled.properties
graph.labels.fcl.bytearray
graph.labels.fcl.pointers
graph.labels.fcl.properties
graph.node2swhid.bin
graph.node2type.bin
graph.properties
graph.property.content.is_skipped.bits
graph.property.content.length.bin
graph.pthash
graph.pthash.order
Configuring files’ download#
What follows also requires the hpc
dependency group:
$ pip install swh-fuse[hpc]
You can configure how swh-fuse
will download files’ content with the following section:
content
:storage
: an usual storage configuration, like:cls: remote
url: http://localhost:8080
objstorage
: an usual objstorage configuration, like:cls: remote
url: http://localhost:8080
objstorage
is optional,
as the storage
service may be able to provide files’ contents,
but this will probably be slower.
When objstorage
is provided,
storage
will be called only to match SWHIDs with contents’ hashes set:
you’ll probably want to set cls: digestmap
,
provided by the package swh.digestmap.
It has been developed for that case and will be the fastest back-end.
Sample configuration: Large-scale access on a dedicated HPC#
If you plan to use swh-fuse
on a dedicated cluster containing an archive replica
(as in the CodeCommons project),
you can connect swh-fuse
to a compressed graph and also to local
storage and objstorage
instances as follows.
In that case we can disable the cache entirely,
to save memory on the mounting system.
swh:
fuse:
cache:
metadata:
in-memory: true
blob:
bypass: true
graph:
grpc-url: swh-graph-grpc.local:50091
content:
storage:
cls: remote
path: http://storage.local
enable_requests_retry: true
objstorage:
cls: remote
url: http://objstorage.local
enable_requests_retry: true