Configuration#
The configuration for the Software Heritage Filesystem resides in the
swh > fuse section of the shared YAML configuration
file used by all Software Heritage tools, located by default at
~/.config/swh/global.yml (following the XDG Base Directory specification).
You can override this path on the command line via the
-C/--config-file flag.
You can choose how swh-fuse will fetch content from the archive.
The default and simplest way is to query the SWH public API.
This method can be configured with the following block:
web-api:url: archive API URL (https://archive.softwareheritage.org/api/1/)auth-token: (optional, but recommended) authentication token used with the API URL
swh-fuse will also search for the following fields:
cache: a section that can contain:metadata: a dict configuring where to store the metadata cache. It can either contain anin-memoryboolean entry, set totrue, or apathstring entry, pointing to the file.blob: a dict configuring where to store the blob cache, with the same entries asmetadata. If the dict contains abypassentry set totrue, this cache will be disabled entirely - this can be useful in the HPC setting (see below).direntry: how much memory should be used by the direntry cache, specified using amaxramentry (either as a percentage of available RAM, or with disk storage unit suffixes:B,KB,MB,GB).
json-indent: number of spaces used to print JSON metadata files. Setting it tonulldisables indentation.
Example#
Here is a full ~/.config/swh/global.yml equivalent to the default configuration:
swh:
fuse:
cache:
metadata:
path: "/home/user/.cache/swh/fuse/metadata.sqlite"
blob:
path: "/home/user/.cache/swh/fuse/blob.sqlite"
direntry:
maxram: "10%"
web-api:
url: "https://archive.softwareheritage.org/api/1/"
auth-token: "eyJhbGciOiJIUzI1NiIsInR5cCIgOiAiSldUIiwia2..."
json-indent: 2
Logging#
The default logging level is set to INFO and can be configured with the
SWH_LOG_LEVEL environment variable, or through the
shared command line interface via the -l/--log-level
flag.
$ swh --log-level swh.fuse:DEBUG fs mount swhfs/ -f
Faster file system traversal with a local compressed graph#
In order to traverse the folder hierarchy much faster,
connect to a compressed graph
via its gRPC API.
To do so, install with the hpc dependency group:
$ pip install swh-fuse[hpc]
Then, this can be enabled with the following configuration section:
graph:grpc-url: URL to the graph’s gRPC server.
If that server instance will only be used for swh-fuse,
since version 6.7.2 of swh-graph
you can use the --direction=forward option when starting the gRPC server
and you do not need any graph*transposed* files.
Note
If you don’t need to read revision and releases information (that we usually put in
meta.json),
then you also do not need to download/store the whole compressed graph.
The following files are enough, halving the required storage:
graph.ef
graph.graph
graph-labelled.ef
graph-labelled.labeloffsets
graph-labelled.labels
graph-labelled.properties
graph.labels.fcl.bytearray
graph.labels.fcl.pointers
graph.labels.fcl.properties
graph.node2swhid.bin
graph.node2type.bin
graph.properties
graph.property.content.is_skipped.bits
graph.property.content.length.bin
graph.pthash
graph.pthash.order
Configuring files’ content download#
What follows requires the hpc dependency group:
$ pip install swh-fuse[hpc]
You can configure how swh-fuse will download files’ content with the following section:
content:storage: an usual storage configuration, like:cls: remoteurl: http://localhost:8080
objstorage: an usual objstorage configuration, like:cls: remoteurl: http://localhost:8080
objstorage is optional,
as the storage service may be able to provide files’ contents,
but this will probably be slower.
When objstorage is provided,
storage will be called only to match SWHIDs with contents’ hashes set:
you’ll probably want to set cls: digestmap.
That class is provided by the package swh.digestmap,
installed along the HPC dependency group.
It has been developed for that case and will be the fastest back-end.
Sample configuration: Large-scale access on a dedicated HPC#
If you plan to use swh-fuse on a dedicated cluster containing an archive replica
(as in the CodeCommons project),
you can connect swh-fuse to a compressed graph and also to local
storage and objstorage
instances as follows.
In that case we can disable the cache entirely,
to save memory on the mounting system.
swh:
fuse:
cache:
metadata:
in-memory: true
blob:
bypass: true
graph:
grpc-url: swh-graph-grpc.local:50091
content:
storage:
cls: remote
path: http://storage.local
enable_requests_retry: true
objstorage:
cls: remote
url: http://objstorage.local
enable_requests_retry: true
Monitoring#
When using a compressed graph or content back-ends,
swh-fuse sends statsd metrics
to localhost:8125 by default.
This can be changed from environment variables,
in particular STATSD_HOST and STATSD_PORT.
Expect the following metrics:
swhfuse_waiting_grapha timer measuring how long we are waiting for the graph backendswhfuse_waiting_storagea timer measuring how long we are waiting for the storage backendswhfuse_waiting_objstoragea timer measuring how long we are waiting for the objstorage (contents) backendswhfuse_get_bloba counter of calls to storage/objstorageswhfuse_blob_not_in_storagea counter of failed calls to storage (including objects not found)swhfuse_blob_not_in_objstoragea counter of failed calls to objstorage (including objects not found)