swh.dataset.utils module#

class swh.dataset.utils.ZSTFile(path: str, mode: str = 'r')[source]#

Bases: object

Object-like wrapper around a ZST file. Uses a subprocess of the “zstd” command to compress and deflate the objects.

read(*args)[source]#
write(buf)[source]#
class swh.dataset.utils.SQLiteSet(db_path)[source]#

Bases: object

On-disk Set object for hashes using SQLite as an indexer backend. Used to deduplicate objects when processing large queues with duplicates.

add(v: bytes) bool[source]#

Add an item to the set.

Parameters:

v – The value to add to the set.

Returns:

True if the value was added to the set, False if it was already present.

class swh.dataset.utils.LevelDBSet(db_path)[source]#

Bases: object

On-disk Set object for hashes using LevelDB as an indexer backend. Used to deduplicate objects when processing large queues with duplicates.

add(v: bytes) bool[source]#

Add an item to the set.

Parameters:

v – The value to add to the set.

Returns:

True if the value was added to the set, False if it was already present.

swh.dataset.utils.remove_pull_requests(snapshot)[source]#

Heuristic to filter out pull requests in snapshots: remove all branches that start with refs/ but do not start with refs/heads or refs/tags.