swh.fuse.cache module#

async swh.fuse.cache.db_connect(conf: Dict[str, Any]) → Connection[source]#

class swh.fuse.cache.FuseCache(cache_conf: Dict[str, Any])[source]#

Bases: object

SwhFS retrieves both metadata and file contents from the Software Heritage archive via the network. In order to obtain reasonable performances several caches are used to minimize network transfer.

Caches are stored on disk in SQLite databases located at $XDG_CACHE_HOME/swh/fuse/.

All caches are persistent (i.e., they survive the restart of the SwhFS process) and global (i.e., they are shared by concurrent SwhFS processes).

We assume that no cache invalidation is necessary, due to intrinsic properties of the Software Heritage archive, such as integrity verification and append-only archive changes. To clean the caches one can just remove the corresponding files from disk.

async get_cached_swhids() → AsyncGenerator[CoreSWHID, None][source]#: Return a list of all previously cached SWHID

async get_cached_visits() → AsyncGenerator[str, None][source]#: Return a list of all previously cached visit URL

class swh.fuse.cache.AbstractCache(conf: Dict[str, Any], conn: Connection | None = None)[source]#

Bases: ABC

Abstract cache implementation to share common behavior between cache types

DB_SCHEMA: str = ''#

conn: Connection#

conf: Dict[str, Any]#

class swh.fuse.cache.MetadataCache(conf: Dict[str, Any], conn: Connection | None = None)[source]#

Bases: AbstractCache

The metadata cache map each artifact to the complete metadata of the referenced object. This is analogous to what is available in archive/<SWHID>.json file (and generally used as data source for returning the content of those files). Artifacts are identified using their SWHIDs, or in the case of origin visits, using their URLs.

DB_SCHEMA: str = '\n create table if not exists metadata_cache (\n swhid text not null primary key,\n metadata blob,\n date text\n );\n\n create table if not exists visits_cache (\n url text not null primary key,\n metadata blob,\n itime timestamp -- insertion time\n );\n '#

async get(swhid: CoreSWHID, typify: bool = True) → Any[source]#

async get_visits(url_encoded: str) → List[Dict[str, Any]] | None[source]#

async set(swhid: CoreSWHID, metadata: Any) → None[source]#

async set_visits(url_encoded: str, visits: List[Dict[str, Any]]) → None[source]#

async remove(swhid: CoreSWHID) → None[source]#

class swh.fuse.cache.BlobCache(conf: Dict[str, Any], conn: Connection | None = None)[source]#

Bases: AbstractCache

The blob cache map SWHIDs of type cnt to the bytes of their archived content.

The blob cache entry for a given content object is populated, at the latest, the first time the object is read()-d. It might be populated earlier on due to prefetching, e.g., when a directory pointing to the given content is listed for the first time.

DB_SCHEMA: str = '\n create table if not exists blob_cache (\n swhid text not null primary key,\n blob blob\n );\n '#

async get(swhid: CoreSWHID) → bytes | None[source]#

async set(swhid: CoreSWHID, blob: bytes) → None[source]#

async remove(swhid: CoreSWHID) → None[source]#

class swh.fuse.cache.NoopBlobCache(conf: Dict[str, Any], conn: Connection | None = None)[source]#

Bases: BlobCache

This does not cache anything at all: use it to save some memory, if you have access to a fast objstorage.

DB_SCHEMA: str = ''#

async get(swhid: CoreSWHID) → bytes | None[source]#

async set(swhid: CoreSWHID, blob: bytes) → None[source]#

async remove(swhid: CoreSWHID) → None[source]#

class swh.fuse.cache.HistoryCache(conf: Dict[str, Any], conn: Connection | None = None)[source]#

Bases: AbstractCache

The history cache map SWHIDs of type rev to a list of rev SWHIDs corresponding to all its revision ancestors, sorted in reverse topological order. As the parents cache, the history cache is lazily populated and can be prefetched. To efficiently store the ancestor lists, the history cache represents ancestors as graph edges (a pair of two SWHID nodes), meaning the history cache is shared amongst all revisions parents.

DB_SCHEMA: str = '\n create table if not exists history_graph (\n src text not null,\n dst text not null,\n unique(src, dst)\n );\n create index if not exists idx_history on history_graph(src);\n '#

HISTORY_REC_QUERY = '\n with recursive\n dfs(node) AS (\n values(?)\n union\n select history_graph.dst\n from history_graph\n join dfs on history_graph.src = dfs.node\n )\n -- Do not keep the root node since it is not an ancestor\n select * from dfs limit -1 offset 1\n '#

async get(swhid: CoreSWHID) → List[CoreSWHID] | None[source]#

async get_with_date_prefix(swhid: CoreSWHID, date_prefix: str) → List[Tuple[CoreSWHID, str]][source]#

async set(history: List[Tuple[str, str]]) → None[source]#

class swh.fuse.cache.DirEntryCache(conf: Dict[str, Any])[source]#

Bases: object

The direntry cache map inode representing directories to the entries they contain. Each entry comes with its name as well as file attributes (i.e., all its needed to perform a detailed directory listing).

Additional attributes of each directory entry should be looked up on a entry by entry basis, possibly hitting other caches.

The direntry cache for a given dir is populated, at the latest, when the content of the directory is listed. More aggressive prefetching might happen. For instance, when first opening a dir a recursive listing of it can be retrieved from the remote backend and used to recursively populate the direntry cache for all (transitive) sub-directories.

class LRU(max_ram: int)[source]#

Bases: OrderedDict

max_ram: int#

used_ram: int = 0#

sizeof(value: Any) → int[source]#

get(direntry: FuseDirEntry) → List[FuseEntry] | None[source]#

set(direntry: FuseDirEntry, entries: List[FuseEntry]) → None[source]#

invalidate(direntry: FuseDirEntry) → None[source]#

swh.fuse.cache module#

This Page