.. _swh-fuse-config: Configuration ============= The configuration for the Software Heritage Filesystem resides in the ``swh > fuse`` section of the shared `YAML `_ configuration file used by all Software Heritage tools, located by default at ``~/.config/swh/global.yml`` (following the `XDG Base Directory `_ specification). You can override this path on the :ref:`command line ` via the ``-C/--config-file`` flag. You can choose how ``swh-fuse`` will fetch content from the archive. The default and simplest way is to query the SWH public API. This method can be configured with the following block: - ``web-api``: - ``url``: archive API URL (:swh_web:`api/1/`) - ``auth-token``: (optional, but recommended) authentication token used with the API URL ``swh-fuse`` will also search for the following fields: - ``cache``: a section that can contain: - ``metadata``: a dict configuring where to store the metadata cache. It can either contain an ``in-memory`` boolean entry, set to ``true``, or a ``path`` string entry, pointing to the file. - ``blob``: a dict configuring where to store the blob cache, with the same entries as ``metadata``. If the dict contains a ``bypass`` entry set to ``true``, this cache will be disabled entirely - this can be useful in the HPC setting (see below). - ``direntry``: how much memory should be used by the direntry cache, specified using a ``maxram`` entry (either as a percentage of available RAM, or with disk storage unit suffixes: ``B``, ``KB``, ``MB``, ``GB``). - ``json-indent``: number of spaces used to print JSON metadata files. Setting it to ``null`` disables indentation. Example ------- Here is a full ``~/.config/swh/global.yml`` equivalent to the default configuration: .. code:: yaml swh: fuse: cache: metadata: path: "/home/user/.cache/swh/fuse/metadata.sqlite" blob: path: "/home/user/.cache/swh/fuse/blob.sqlite" direntry: maxram: "10%" web-api: url: "https://archive.softwareheritage.org/api/1/" auth-token: "eyJhbGciOiJIUzI1NiIsInR5cCIgOiAiSldUIiwia2..." json-indent: 2 Logging ------- The default logging level is set to ``INFO`` and can be configured with the ``SWH_LOG_LEVEL`` environment variable, or through the :ref:`shared command line interface ` via the ``-l/--log-level`` flag. .. code:: bash $ swh --log-level swh.fuse:DEBUG fs mount swhfs/ -f .. _swh-fuse-config-graph: Faster file system traversal with a local compressed graph ---------------------------------------------------------- In order to traverse the folder hierarchy much faster, connect to a :ref:`compressed graph ` via its :ref:`gRPC API `. To do so, install with the ``hpc`` dependency group:: $ pip install swh-fuse[hpc] Then, this can be enabled with the following configuration section: - ``graph``: - ``grpc-url``: URL to the graph's :ref:`gRPC server `. If that server instance will only be used for ``swh-fuse``, since version 6.7.2 of ``swh-graph`` you can use the ``--direction=forward`` option when starting the gRPC server and you do not need any ``graph*transposed*`` files. .. note:: If you don't need to read revision and releases information (that we usually put in ``meta.json``), then you also do not need to download/store the whole compressed graph. The following files are enough, halving the required storage: * graph.ef * graph.graph * graph-labelled.ef * graph-labelled.labeloffsets * graph-labelled.labels * graph-labelled.properties * graph.labels.fcl.bytearray * graph.labels.fcl.pointers * graph.labels.fcl.properties * graph.node2swhid.bin * graph.node2type.bin * graph.properties * graph.property.content.is_skipped.bits * graph.property.content.length.bin * graph.pthash * graph.pthash.order .. _swh-fuse-config-teaser-graph-webapi: Sample configuration: teaser graph + WebAPI ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using the following configuration, ``swh-fuse`` will connect to a local graph gRPC API when creating its folders structure. Files' content will be downloaded from our Web API. This also switches to a volatile ``metadata`` cache, because it can be provided quickly by the graph. .. code:: yaml swh: fuse: cache: metadata: in-memory: true blob: path: "/path/to/cache/blob.sqlite" graph: grpc-url: localhost:50091 web-api: auth-token: "yhbGcOiJI1z1NiIsInR5CIgOiAiSlduIiWia2..." .. note:: The way we encode symbolic links requires an access to the contents storage (cf. :py:func:`swh.model.git_objects.directory_git_object`), so in that setting source tree traversals can still cause accesses to the Web API. .. _swh-fuse-config-file-download: Configuring files' content download ----------------------------------- What follows requires the ``hpc`` dependency group:: $ pip install swh-fuse[hpc] You can configure how ``swh-fuse`` will download files' content with the following section: - ``content``: - ``storage``: an usual :ref:`storage ` configuration, like: - ``cls: remote`` - ``url: http://localhost:8080`` - ``objstorage``: an usual :ref:`objstorage ` configuration, like: - ``cls: remote`` - ``url: http://localhost:8080`` ``objstorage`` is optional, as the ``storage`` service may be able to provide files' contents, but this will probably be slower. When ``objstorage`` is provided, ``storage`` will be called only to match SWHIDs with contents' hashes set: you'll probably want to set ``cls: digestmap``. That class is provided by the package :ref:`swh.digestmap `, installed along the HPC dependency group. It has been developed for that case and will be the fastest back-end. .. _swh-fuse-config-teaser-graph-s3: Sample configuration: teaser graph + S3 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using the following configuration, ``swh-fuse`` will connect to a local graph gRPC API when creating its folders structure. Files' contents will be downloaded from our S3 mirror (cf. :py:mod:`swh.objstorage.backends.http`:) but cached locally to speed up repeated access to the same files. This can be useful to test on your own machine, using a :ref:`teaser dataset ` and its corresponding :ref:`digestmap `. To ensure the digestmap implementation is available, invoke ``pip install swh-digestmap``. .. code:: yaml swh: fuse: cache: metadata: in-memory: true blob: path: "/path/to/cache/blob.sqlite" graph: grpc-url: localhost:50091 content: storage: cls: digestmap path: /home/user/graphs/digestmap-folder objstorage: cls: http url: https://softwareheritage.s3.amazonaws.com/content/ compression: gzip retry: total: 3 backoff_factor: 0.2 status_forcelist: - 404 - 500 .. _swh-fuse-config-hpc: Sample configuration: Large-scale access on a dedicated HPC ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you plan to use ``swh-fuse`` on a dedicated cluster containing an archive replica (as in the `CodeCommons `_ project), you can connect ``swh-fuse`` to a compressed graph and also to local :ref:`storage ` and :ref:`objstorage ` instances as follows. In that case we can disable the cache entirely, to save memory on the mounting system. .. code:: yaml swh: fuse: cache: metadata: in-memory: true blob: bypass: true graph: grpc-url: swh-graph-grpc.local:50091 content: storage: cls: remote path: http://storage.local enable_requests_retry: true objstorage: cls: remote url: http://objstorage.local enable_requests_retry: true Monitoring ---------- When using a compressed graph or content back-ends, ``swh-fuse`` sends `statsd `_ metrics to ``localhost:8125`` by default. This can be changed from `environment variables `_, in particular ``STATSD_HOST`` and ``STATSD_PORT``. Expect the following metrics: * ``swhfuse_waiting_graph`` a timer measuring how long we are waiting for the graph backend * ``swhfuse_waiting_storage`` a timer measuring how long we are waiting for the storage backend * ``swhfuse_waiting_objstorage`` a timer measuring how long we are waiting for the objstorage (contents) backend * ``swhfuse_get_blob`` a counter of calls to storage/objstorage * ``swhfuse_blob_not_in_storage`` a counter of failed calls to storage (including objects not found) * ``swhfuse_blob_not_in_objstorage`` a counter of failed calls to objstorage (including objects not found)