swh.storage package

Submodules

swh.storage.cli module

swh.storage.cli.main()[source]

swh.storage.common module

swh.storage.converters module

swh.storage.converters.author_to_db(author)[source]

Convert a swh-model author to its DB representation.

Parameters:author – a swh.model compatible author
Returns:a dictionary with three keys: author, fullname and email
Return type:dict
swh.storage.converters.db_to_author(id, fullname, name, email)[source]

Convert the DB representation of an author to a swh-model author.

Parameters:
  • id (long) – the author’s identifier
  • fullname (bytes) – the author’s fullname
  • name (bytes) – the author’s name
  • email (bytes) – the author’s email
Returns:

a dictionary with four keys: id, fullname, name and email, or None if the id is None

Return type:

dict

swh.storage.converters.git_headers_to_db(git_headers)[source]

Convert git headers to their database representation.

We convert the bytes to unicode by decoding them into utf-8 and replacing invalid utf-8 sequences with backslash escapes.

swh.storage.converters.db_to_git_headers(db_git_headers)[source]
swh.storage.converters.db_to_date(date, offset, neg_utc_offset)[source]

Convert the DB representation of a date to a swh-model compatible date.

Parameters:
  • date (datetime.datetime) – a date pulled out of the database
  • offset (int) – an integer number of minutes representing an UTC offset
  • neg_utc_offset (boolean) – whether an utc offset is negative
Returns:

a dict with three keys:

  • timestamp: a timestamp from UTC
  • offset: the number of minutes since UTC
  • negative_utc: whether a null UTC offset is negative

Return type:

dict

swh.storage.converters.date_to_db(date_offset)[source]

Convert a swh-model date_offset to its DB representation.

Parameters:date_offset – a swh.model compatible date_offset
Returns:a dictionary with three keys:
  • timestamp: a date in ISO format
  • offset: the UTC offset in minutes
  • neg_utc_offset: a boolean indicating whether a null offset is negative or positive.
Return type:dict
swh.storage.converters.revision_to_db(revision)[source]

Convert a swh-model revision to its database representation.

swh.storage.converters.db_to_revision(db_revision)[source]

Convert a database representation of a revision to its swh-model representation.

swh.storage.converters.release_to_db(release)[source]

Convert a swh-model release to its database representation.

swh.storage.converters.db_to_release(db_release)[source]

Convert a database representation of a release to its swh-model representation.

swh.storage.db module

class swh.storage.db.Db(conn, pool=None)[source]

Bases: swh.core.db.BaseDb

Proxy to the SWH DB, with wrappers around stored procedures

mktemp_dir_entry(entry_type, cur=None)[source]
mktemp_revision(cur=None)[source]
mktemp_release(cur=None)[source]
mktemp_snapshot_branch(cur=None)[source]
register_listener(notify_queue, cur=None)[source]

Register a listener for NOTIFY queue notify_queue

listen_notifies(timeout)[source]

Listen to notifications for timeout seconds

content_add_from_temp(cur=None)[source]
directory_add_from_temp(cur=None)[source]
skipped_content_add_from_temp(cur=None)[source]
revision_add_from_temp(cur=None)[source]
release_add_from_temp(cur=None)[source]
content_update_from_temp(keys_to_update, cur=None)[source]
content_get_metadata_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status']
content_add_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status', 'ctime']
skipped_content_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'reason', 'status', 'origin']
content_get_metadata_from_sha1s(sha1s, cur=None)[source]
content_get_range(start, end, limit=None, cur=None)[source]

Retrieve contents within range [start, end].

content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']
content_missing_from_list(contents, cur=None)[source]
content_missing_per_sha1(sha1s, cur=None)[source]
skipped_content_missing(contents, cur=None)[source]
snapshot_exists(snapshot_id, cur=None)[source]

Check whether a snapshot with the given id exists

snapshot_add(snapshot_id, cur=None)[source]

Add a snapshot from the temporary table

snapshot_count_cols = ['target_type', 'count']
snapshot_count_branches(snapshot_id, cur=None)[source]
snapshot_get_cols = ['snapshot_id', 'name', 'target', 'target_type']
snapshot_get_by_id(snapshot_id, branches_from=b'', branches_count=None, target_types=None, cur=None)[source]
snapshot_get_by_origin_visit(origin_id, visit_id, cur=None)[source]
content_find_cols = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'ctime', 'status']
content_find(sha1=None, sha1_git=None, sha256=None, blake2s256=None, cur=None)[source]

Find the content optionally on a combination of the following checksums sha1, sha1_git, sha256 or blake2s256.

Parameters:
  • sha1 – sha1 content
  • git_sha1 – the sha1 computed a la git sha1 of the content
  • sha256 – sha256 content
  • blake2s256 – blake2s256 content
Returns:

The tuple (sha1, sha1_git, sha256, blake2s256) if found or None.

directory_missing_from_list(directories, cur=None)[source]
directory_ls_cols = ['dir_id', 'type', 'target', 'name', 'perms', 'status', 'sha1', 'sha1_git', 'sha256', 'length']
directory_walk_one(directory, cur=None)[source]
directory_walk(directory, cur=None)[source]
directory_entry_get_by_path(directory, paths, cur=None)[source]

Retrieve a directory entry by path.

revision_missing_from_list(revisions, cur=None)[source]
revision_add_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic']
revision_get_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'author_id', 'committer_id', 'parents']
origin_visit_add(origin, ts, type, cur=None)[source]

Add a new origin_visit for origin origin at timestamp ts with status ‘ongoing’.

Parameters:
  • origin – origin concerned by the visit
  • ts – the date of the visit
  • type – type of loader for the visit
Returns:

The new visit index step for that origin

origin_visit_update(origin_id, visit_id, updates, cur=None)[source]

Update origin_visit’s status.

origin_visit_upsert(origin, visit, date, type, status, metadata, snapshot, cur=None)[source]
origin_visit_get_cols = ['origin', 'visit', 'date', 'type', 'status', 'metadata', 'snapshot']
origin_visit_get_all(origin_id, last_visit=None, limit=None, cur=None)[source]

Retrieve all visits for origin with id origin_id.

Parameters:origin_id – The occurrence’s origin
Yields:The occurrence’s history visits
origin_visit_get(origin_id, visit_id, cur=None)[source]

Retrieve information on visit visit_id of origin origin_id.

Parameters:
  • origin_id – the origin concerned
  • visit_id – The visit step for that origin
Returns:

The origin_visit information

origin_visit_find_by_date(origin, visit_date, cur=None)[source]
origin_visit_exists(origin_id, visit_id, cur=None)[source]

Check whether an origin visit with the given ids exists

origin_visit_get_latest(origin_id, allowed_statuses=None, require_snapshot=False, cur=None)[source]

Retrieve the most recent origin_visit of the given origin, with optional filters.

Parameters:
  • origin_id – the origin concerned
  • allowed_statuses – the visit statuses allowed for the returned visit
  • require_snapshot (bool) – If True, only a visit with a known snapshot will be returned.
Returns:

The origin_visit information, or None if no visit matches.

static mangle_query_key(key, main_table)[source]
revision_get_from_list(revisions, cur=None)[source]
revision_log(root_revisions, limit=None, cur=None)[source]
revision_shortlog_cols = ['id', 'parents']
revision_shortlog(root_revisions, limit=None, cur=None)[source]
release_missing_from_list(releases, cur=None)[source]
object_find_by_sha1_git_cols = ['sha1_git', 'type', 'id', 'object_id']
object_find_by_sha1_git(ids, cur=None)[source]
stat_counters(cur=None)[source]
fetch_history_cols = ['origin', 'date', 'status', 'result', 'stdout', 'stderr', 'duration']
create_fetch_history(fetch_history, cur=None)[source]

Create a fetch_history entry with the data in fetch_history

get_fetch_history(fetch_history_id, cur=None)[source]

Get a fetch_history entry with the given id

update_fetch_history(fetch_history, cur=None)[source]

Update the fetch_history entry from the data in fetch_history

origin_add(type, url, cur=None)[source]

Insert a new origin and return the new identifier.

origin_cols = ['id', 'type', 'url']
origin_get_with(origins, cur=None)[source]

Retrieve the origin id from its type and url if found.

origin_get(ids, cur=None)[source]

Retrieve the origin per its identifier.

origin_get_range(origin_from=1, origin_count=100, cur=None)[source]

Retrieve origin_count origins whose ids are greater or equal than origin_from.

Origins are sorted by id before retrieving them.

Parameters:
  • origin_from (int) – the minimum id of origins to retrieve
  • origin_count (int) – the maximum number of origins to retrieve
_origin_query(url_pattern, count=False, offset=0, limit=50, regexp=False, with_visit=False, cur=None)[source]

Method factorizing query creation for searching and counting origins.

Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way.

Parameters:
  • url_pattern (str) – the string pattern to search for in origin urls
  • offset (int) – number of found origins to skip before returning results
  • limit (int) – the maximum number of found origins to return
  • regexp (bool) – if True, consider the provided pattern as a regular expression and returns origins whose urls match it
  • with_visit (bool) – if True, filter out origins with no visit
origin_count(url_pattern, regexp=False, with_visit=False, cur=None)[source]

Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way.

Parameters:
  • url_pattern (str) – the string pattern to search for in origin urls
  • regexp (bool) – if True, consider the provided pattern as a regular expression and returns origins whose urls match it
  • with_visit (bool) – if True, filter out origins with no visit
person_cols = ['fullname', 'name', 'email']
person_get_cols = ['fullname', 'name', 'email', 'id']
person_get(ids, cur=None)[source]

Retrieve the persons identified by the list of ids.

release_add_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'name', 'comment', 'synthetic', 'author_fullname', 'author_name', 'author_email']
release_get_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'name', 'comment', 'synthetic', 'author_fullname', 'author_name', 'author_email', 'author_id']
release_get_from_list(releases, cur=None)[source]
origin_metadata_add(origin, ts, provider, tool, metadata, cur=None)[source]

Add an origin_metadata for the origin at ts with provider, tool and metadata.

Parameters:
  • origin (int) – the origin’s id for which the metadata is added
  • ts (datetime) – time when the metadata was found
  • provider (int) – the metadata provider identifier
  • tool (int) – the tool’s identifier used to extract metadata
  • metadata (jsonb) – the metadata retrieved at the time and location
Returns:

the origin_metadata unique id

Return type:

id (int)

origin_metadata_get_cols = ['origin_id', 'discovery_date', 'tool_id', 'metadata', 'provider_id', 'provider_name', 'provider_type', 'provider_url']
origin_metadata_get_by(origin_id, provider_type=None, cur=None)[source]

Retrieve all origin_metadata entries for one origin_id

tool_cols = ['id', 'name', 'version', 'configuration']
mktemp_tool(cur=None)[source]
tool_add_from_temp(cur=None)[source]
tool_get(name, version, configuration, cur=None)[source]
metadata_provider_cols = ['id', 'provider_name', 'provider_type', 'provider_url', 'metadata']
metadata_provider_add(provider_name, provider_type, provider_url, metadata, cur=None)[source]

Insert a new provider and return the new identifier.

metadata_provider_get(provider_id, cur=None)[source]
metadata_provider_get_by(provider_name, provider_url, cur=None)[source]
__module__ = 'swh.storage.db'

swh.storage.exc module

exception swh.storage.exc.StorageDBError[source]

Bases: Exception

Specific storage db error (connection, erroneous queries, etc…)

__str__()[source]

Return str(self).

__module__ = 'swh.storage.exc'
__weakref__

list of weak references to the object (if defined)

exception swh.storage.exc.StorageAPIError[source]

Bases: Exception

Specific internal storage api (mainly connection)

__str__()[source]

Return str(self).

__module__ = 'swh.storage.exc'
__weakref__

list of weak references to the object (if defined)

swh.storage.in_memory module

swh.storage.in_memory.now()[source]
class swh.storage.in_memory.Storage(journal_writer=None)[source]

Bases: object

__init__(journal_writer=None)[source]

Initialize self. See help(type(self)) for accurate signature.

reset()[source]
check_config(*, check_write)[source]

Check that the storage is configured and ready to go.

_content_add(contents, with_data)[source]
content_add(content)[source]

Add content blobs to the storage

Parameters:content (iterable) –

iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys:

  • data (bytes): the actual content
  • length (int): content length (default: -1)
  • one key for each checksum algorithm in swh.model.hashutil.DEFAULT_ALGORITHMS, mapped to the corresponding checksum
  • status (str): one of visible, hidden, absent
  • reason (str): if status = absent, the reason why
  • origin (int): if status = absent, the origin we saw the content in
Raises:HashCollision in case of collision
Returns:content:add: New contents added content_bytes:add: Sum of the contents’ length data skipped_content:add: New skipped contents (no data) added
Return type:Summary dict with the following key and associated values
content_add_metadata(content)[source]

Add content metadata to the storage (like content_add, but without inserting to the objstorage).

Parameters:content (iterable) –

iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys:

  • length (int): content length (default: -1)
  • one key for each checksum algorithm in swh.model.hashutil.DEFAULT_ALGORITHMS, mapped to the corresponding checksum
  • status (str): one of visible, hidden, absent
  • reason (str): if status = absent, the reason why
  • origin (int): if status = absent, the origin we saw the content in
  • ctime (datetime): time of insertion in the archive
Raises:HashCollision in case of collision
Returns:content:add: New contents added skipped_content:add: New skipped contents (no data) added
Return type:Summary dict with the following key and associated values
content_get(content)[source]

Retrieve in bulk contents and their data.

This function may yield more blobs than provided sha1 identifiers, in case they collide.

Parameters:

content – iterables of sha1

Yields:

Dict[str, bytes]

Generates streams of contents as dict with their

raw data:

  • sha1 (bytes): content id
  • data (bytes): content’s raw data
Raises:
  • ValueError in case of too much contents are required.
  • cf. BULK_BLOCK_CONTENT_LEN_MAX
content_get_range(start, end, limit=1000, db=None, cur=None)[source]

Retrieve contents within range [start, end] bound by limit.

Note that this function may return more than one blob per hash. The limit is enforced with multiplicity (ie. two blobs with the same hash will count twice toward the limit).

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **limit** (int) – Limit result (default to 1000)
Returns:

  • contents [dict]: iterable of contents in between the range.
  • next (bytes): There remains content in the range starting from this next sha1

Return type:

a dict with keys

content_get_metadata(content)[source]

Retrieve content metadata in bulk

Parameters:content – iterable of content identifiers (sha1)
Returns:an iterable with content metadata corresponding to the given ids
content_find(content)[source]
content_missing(content, key_hash='sha1')[source]

List content missing from storage

Parameters:
  • contents ([dict]) – iterable of dictionaries whose keys are either ‘length’ or an item of swh.model.hashutil.ALGORITHMS; mapped to the corresponding checksum (or length).
  • key_hash (str) – name of the column to use as hash id result (default: ‘sha1’)
Returns:

missing content ids (as per the key_hash column)

Return type:

iterable ([bytes])

content_missing_per_sha1(contents)[source]

List content missing from storage based only on sha1.

Parameters:contents – Iterable of sha1 to check for absence.
Returns:missing ids
Return type:iterable
Raises:TODO – an exception when we get a hash collision.
directory_add(directories)[source]

Add directories to the storage

Parameters:directories (iterable) –

iterable of dictionaries representing the individual directories to add. Each dict has the following keys:

  • id (sha1_git): the id of the directory to add
  • entries (list): list of dicts for each entry in the
    directory. Each dict has the following keys:
    • name (bytes)
    • type (one of ‘file’, ‘dir’, ‘rev’): type of the directory entry (file, directory, revision)
    • target (sha1_git): id of the object pointed at by the directory entry
    • perms (int): entry permissions
Returns:directory:add: Number of directories actually added
Return type:Summary dict of keys with associated count as values
directory_missing(directories)[source]

List directories missing from storage

Parameters:directories (iterable) – an iterable of directory ids
Yields:missing directory ids
_join_dentry_to_content(dentry)[source]
_directory_ls(directory_id, recursive, prefix=b'')[source]
directory_ls(directory, recursive=False)[source]

Get entries for one directory.

Parameters:
  • directory (-) – the directory to list entries from.
  • recursive (-) – if flag on, this list recursively from this directory.
Returns:

List of entries for such directory.

If recursive=True, names in the path of a dir/file not at the root are concatenated with a slash (/).

directory_entry_get_by_path(directory, paths)[source]

Get the directory entry (either file or dir) from directory with path.

Parameters:
  • directory (-) – sha1 of the top level directory
  • paths (-) – path to lookup from the top level directory. From left (top) to right (bottom).
Returns:

The corresponding directory entry if found, None otherwise.

_directory_entry_get_by_path(directory, paths, prefix)[source]
revision_add(revisions)[source]

Add revisions to the storage

Parameters:revisions (Iterable[dict]) –

iterable of dictionaries representing the individual revisions to add. Each dict has the following keys:

  • id (sha1_git): id of the revision to add
  • date (dict): date the revision was written
  • committer_date (dict): date the revision got added to the origin
  • type (one of ‘git’, ‘tar’): type of the revision added
  • directory (sha1_git): the directory the revision points at
  • message (bytes): the message associated with the revision
  • author (Dict[str, bytes]): dictionary with keys: name, fullname, email
  • committer (Dict[str, bytes]): dictionary with keys: name, fullname, email
  • metadata (jsonb): extra information as dictionary
  • synthetic (bool): revision’s nature (tarball, directory creates synthetic revision`)
  • parents (list[sha1_git]): the parents of this revision

date dictionaries have the form defined in swh.model.

Returns:Summary dict of keys with associated count as values
revision_added: New objects actually stored in db
revision_missing(revisions)[source]

List revisions missing from storage

Parameters:revisions (iterable) – revision ids
Yields:missing revision ids
revision_get(revisions)[source]
_get_parent_revs(rev_id, seen, limit)[source]
revision_log(revisions, limit=None)[source]

Fetch revision entry from the given root revisions.

Parameters:
  • revisions – array of root revision to lookup
  • limit – limitation on the output result. Default to None.
Yields:

List of revision log from such revisions root.

revision_shortlog(revisions, limit=None)[source]

Fetch the shortlog for the given revisions

Parameters:
  • revisions – list of root revisions to lookup
  • limit – depth limitation for the output
Yields:

a list of (id, parents) tuples.

release_add(releases)[source]

Add releases to the storage

Parameters:releases (Iterable[dict]) –

iterable of dictionaries representing the individual releases to add. Each dict has the following keys:

  • id (sha1_git): id of the release to add
  • revision (sha1_git): id of the revision the release points to
  • date (dict): the date the release was made
  • name (bytes): the name of the release
  • comment (bytes): the comment associated with the release
  • author (Dict[str, bytes]): dictionary with keys: name, fullname, email

the date dictionary has the form defined in swh.model.

Returns:Summary dict of keys with associated count as values
release:add: New objects contents actually stored in db
release_missing(releases)[source]

List releases missing from storage

Parameters:releases – an iterable of release ids
Returns:a list of missing release ids
release_get(releases)[source]

Given a list of sha1, return the releases’s information

Parameters:releases – list of sha1s
Yields:dicts with the same keys as those given to release_add (or None if a release does not exist)
snapshot_add(snapshots, origin=None, visit=None)[source]

Add a snapshot to the storage

Parameters:snapshot ([dict]) –

the snapshots to add, containing the following keys:

  • id (bytes): id of the snapshot
  • branches (dict): branches the snapshot contains, mapping the branch name (bytes) to the branch target, itself a dict (or None if the branch points to an unknown object)
    • target_type (str): one of content, directory, revision, release, snapshot, alias
    • target (bytes): identifier of the target (currently a sha1_git for all object kinds, or the name of the target branch for aliases)
Raises:ValueError – if the origin’s or visit’s identifier does not exist.
Returns:Summary dict of keys with associated count as values
snapshot_added: Count of object actually stored in db
snapshot_get(snapshot_id)[source]

Get the content, possibly partial, of a snapshot with the given id

The branches of the snapshot are iterated in the lexicographical order of their names.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method snapshot_get_branches() should be used instead.

Parameters:snapshot_id (bytes) – identifier of the snapshot
Returns:
a dict with three keys:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than 1000 branches.
Return type:dict
snapshot_get_by_origin_visit(origin, visit)[source]

Get the content, possibly partial, of a snapshot for the given origin visit

The branches of the snapshot are iterated in the lexicographical order of their names.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method snapshot_get_branches() should be used instead.

Parameters:
  • origin (int) – the origin’s identifier
  • visit (int) – the visit’s identifier
Returns:

None if the snapshot does not exist;
a dict with three keys otherwise:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than 1000 branches.

Return type:

dict

snapshot_get_latest(origin, allowed_statuses=None)[source]

Get the content, possibly partial, of the latest snapshot for the given origin, optionally only from visits that have one of the given allowed_statuses

The branches of the snapshot are iterated in the lexicographical order of their names.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the methods origin_visit_get_latest() and snapshot_get_branches() should be used instead.

Parameters:
  • origin (Union[str,int]) – the origin’s URL or identifier
  • allowed_statuses (list of str) – list of visit statuses considered to find the latest snapshot for the origin. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.
Returns:

a dict with three keys:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than 1000 branches.

Return type:

dict

snapshot_count_branches(snapshot_id, db=None, cur=None)[source]

Count the number of branches in the snapshot with the given id

Parameters:snapshot_id (bytes) – identifier of the snapshot
Returns:A dict whose keys are the target types of branches and values their corresponding amount
Return type:dict
snapshot_get_branches(snapshot_id, branches_from=b'', branches_count=1000, target_types=None)[source]

Get the content, possibly partial, of a snapshot with the given id

The branches of the snapshot are iterated in the lexicographical order of their names.

Parameters:
  • snapshot_id (bytes) – identifier of the snapshot
  • branches_from (bytes) – optional parameter used to skip branches whose name is lesser than it before returning them
  • branches_count (int) – optional parameter used to restrain the amount of returned branches
  • target_types (list) – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)
Returns:

None if the snapshot does not exist;
a dict with three keys otherwise:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than branches_count branches after branches_from included.

Return type:

dict

object_find_by_sha1_git(ids, db=None, cur=None)[source]

Return the objects found with the given ids.

Parameters:ids – a generator of sha1_gits
Returns:a mapping from id to the list of objects found. Each object found is itself a dict with keys:
  • sha1_git: the input id
  • type: the type of object found
  • id: the id of the object found
  • object_id: the numeric id of the object found.
Return type:dict
origin_get(origins)[source]

Return origins, either all identified by their ids or all identified by tuples (type, url).

If the url is given and the type is omitted, one of the origins with that url is returned.

Parameters:origin

a list of dictionaries representing the individual origins to find. These dicts have either the key url (and optionally type):

  • type (FIXME: enum TBD): the origin type (‘git’, ‘wget’, …)
  • url (bytes): the url the origin points to

or the id:

  • id (int): the origin’s identifier
Returns:the origin dictionary with the keys:
  • id: origin’s id
  • type: origin’s type
  • url: origin’s url
Return type:dict
Raises:ValueError – if the keys does not match (url and type) nor id.
origin_get_range(origin_from=1, origin_count=100)[source]

Retrieve origin_count origins whose ids are greater or equal than origin_from.

Origins are sorted by id before retrieving them.

Parameters:
  • origin_from (int) – the minimum id of origins to retrieve
  • origin_count (int) – the maximum number of origins to retrieve
Yields:

dicts containing origin information as returned by swh.storage.in_memory.Storage.origin_get().

Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way.

Parameters:
  • url_pattern (str) – the string pattern to search for in origin urls
  • offset (int) – number of found origins to skip before returning results
  • limit (int) – the maximum number of found origins to return
  • regexp (bool) – if True, consider the provided pattern as a regular expression and return origins whose urls match it
  • with_visit (bool) – if True, filter out origins with no visit
Returns:

An iterable of dict containing origin information as returned by swh.storage.storage.Storage.origin_get().

origin_count(url_pattern, regexp=False, with_visit=False, db=None, cur=None)[source]

Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way.

Parameters:
  • url_pattern (str) – the string pattern to search for in origin urls
  • regexp (bool) – if True, consider the provided pattern as a regular expression and return origins whose urls match it
  • with_visit (bool) – if True, filter out origins with no visit
Returns:

The number of origins matching the search criterion.

Return type:

int

origin_add(origins)[source]

Add origins to the storage

Parameters:origins

list of dictionaries representing the individual origins, with the following keys:

  • type: the origin type (‘git’, ‘svn’, ‘deb’, …)
  • url (bytes): the url the origin points to
Returns:given origins as dict updated with their id
Return type:list
origin_add_one(origin)[source]

Add origin to the storage

Parameters:origin

dictionary representing the individual origin to add. This dict has the following keys:

  • type (FIXME: enum TBD): the origin type (‘git’, ‘wget’, …)
  • url (bytes): the url the origin points to
Returns:the id of the added origin, or of the identical one that already exists.
fetch_history_start(origin_id)[source]

Add an entry for origin origin_id in fetch_history. Returns the id of the added fetch_history entry

fetch_history_end(fetch_history_id, data)[source]

Close the fetch_history entry with id fetch_history_id, replacing its data with data.

fetch_history_get(fetch_history_id)[source]

Get the fetch_history entry with id fetch_history_id.

origin_visit_add(origin, date=None, type=None, *, ts=None)[source]

Add an origin_visit for the origin at date with status ‘ongoing’.

For backward compatibility, type is optional and defaults to the origin’s type.

Parameters:
  • origin (Union[int,str]) – visited origin’s identifier or URL
  • date – timestamp of such visit
  • type (str) – the type of loader used for the visit (hg, git, …)
Returns:

dictionary with keys origin and visit where:

  • origin: origin’s identifier
  • visit: the visit’s identifier for the new visit occurrence

Return type:

dict

origin_visit_update(origin, visit_id, status=None, metadata=None, snapshot=None)[source]

Update an origin_visit’s status.

Parameters:
  • origin (Union[int,str]) – visited origin’s identifier or URL
  • visit_id (int) – visit’s identifier
  • status – visit’s new status
  • metadata – data associated to the visit
  • snapshot (sha1_git) – identifier of the snapshot to add to the visit
Returns:

None

origin_visit_upsert(visits)[source]

Add a origin_visits with a specific id and with all its data. If there is already an origin_visit with the same (origin_id, visit_id), updates it instead of inserting a new one.

Parameters:visits

iterable of dicts with keys:

origin: Visited Origin id visit: origin visit id type: type of loader used for the visit date: timestamp of such visit status: Visit’s new status metadata: Data associated to the visit snapshot (sha1_git): identifier of the snapshot to add to

the visit
origin_visit_get(origin, last_visit=None, limit=None)[source]

Retrieve all the origin’s visit’s information.

Parameters:
  • origin (int) – the origin’s identifier
  • last_visit (int) – visit’s id from which listing the next ones, default to None
  • limit (int) – maximum number of results to return, default to None
Yields:

List of visits.

origin_visit_find_by_date(origin, visit_date)[source]

Retrieves the origin visit whose date is closest to the provided timestamp. In case of a tie, the visit with largest id is selected.

Parameters:
  • origin (str) – The occurrence’s origin (URL).
  • target (datetime) – target timestamp
Returns:

A visit.

origin_visit_get_by(origin, visit)[source]

Retrieve origin visit’s information.

Parameters:origin (int) – the origin’s identifier
Returns:The information on that particular (origin, visit) or None if it does not exist
origin_visit_get_latest(origin, allowed_statuses=None, require_snapshot=False)[source]

Get the latest origin visit for the given origin, optionally looking only for those with one of the given allowed_statuses or for those with a known snapshot.

Parameters:
  • origin (str) – the origin’s URL
  • allowed_statuses (list of str) – list of visit statuses considered to find the latest visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.
  • require_snapshot (bool) – If True, only a visit with a snapshot will be returned.
Returns:

a dict with the following keys:

origin: the URL of the origin visit: origin visit id type: type of loader used for the visit date: timestamp of such visit status: Visit’s new status metadata: Data associated to the visit snapshot (Optional[sha1_git]): identifier of the snapshot

associated to the visit

Return type:

dict

person_get(person)[source]

Return the persons identified by their ids.

Parameters:person – array of ids.
Returns:The array of persons corresponding of the ids.
stat_counters()[source]

compute statistics about the number of tuples in various tables

Returns:a dictionary mapping textual labels (e.g., content) to integer values (e.g., the number of tuples in table content)
Return type:dict
refresh_stat_counters()[source]

Recomputes the statistics for stat_counters.

origin_metadata_add(origin_id, ts, provider, tool, metadata, db=None, cur=None)[source]

Add an origin_metadata for the origin at ts with provenance and metadata.

Parameters:
  • origin_id (int) – the origin’s id for which the metadata is added
  • ts (datetime) – timestamp of the found metadata
  • provider – id of the provider of metadata (ex:’hal’)
  • tool – id of the tool used to extract metadata
  • metadata (jsonb) – the metadata retrieved at the time and location
origin_metadata_get_by(origin_id, provider_type=None, db=None, cur=None)[source]

Retrieve list of all origin_metadata entries for the origin_id

Parameters:
  • origin_id (int) – the unique origin’s identifier
  • provider_type (str) – (optional) type of provider
Returns:

the origin_metadata dictionary with the keys:

  • origin_id (int): origin’s identifier
  • discovery_date (datetime): timestamp of discovery
  • tool_id (int): metadata’s extracting tool
  • metadata (jsonb)
  • provider_id (int): metadata’s provider
  • provider_name (str)
  • provider_type (str)
  • provider_url (str)

Return type:

list of dicts

tool_add(tools)[source]

Add new tools to the storage.

Parameters:tools (iterable of dict) –

Tool information to add to storage. Each tool is a dict with the following keys:

  • name (str): name of the tool
  • version (str): version of the tool
  • configuration (dict): configuration of the tool, must be json-encodable
Returns:All the tools inserted in storage (including the internal id). The order of the list is not guaranteed to match the order of the initial list.
Return type:dict
tool_get(tool)[source]

Retrieve tool information.

Parameters:tool (dict) – Tool information we want to retrieve from storage. The dicts have the same keys as those used in tool_add().
Returns:The full tool information if it exists (id included), None otherwise.
Return type:dict
metadata_provider_add(provider_name, provider_type, provider_url, metadata)[source]

Add a metadata provider.

Parameters:
  • provider_name (str) – Its name
  • provider_type (str) – Its type
  • provider_url (str) – Its URL
  • metadata – JSON-encodable object
Returns:

an identifier of the provider

metadata_provider_get(provider_id, db=None, cur=None)[source]

Get a metadata provider

Parameters:provider_id – Its identifier, as given by metadata_provider_add.
Returns:
same as metadata_provider_add;
or None if it does not exist.
Return type:dict
metadata_provider_get_by(provider, db=None, cur=None)[source]

Get a metadata provider

Parameters:
  • provider_name – Its name
  • provider_url – Its URL
Returns:

same as metadata_provider_add;

or None if it does not exist.

Return type:

dict

_origin_id(origin)[source]
_person_add(person)[source]

Add a person in storage.

Note: Private method, do not use outside of this class.

Parameters:person – dictionary with keys fullname, name and email.
static _content_key(content)[source]

A stable key for a content

static _tool_key(tool)[source]
__dict__ = mappingproxy({'content_add_metadata': <function Storage.content_add_metadata>, 'fetch_history_get': <function Storage.fetch_history_get>, 'origin_visit_get_by': <function Storage.origin_visit_get_by>, 'stat_counters': <function Storage.stat_counters>, 'refresh_stat_counters': <function Storage.refresh_stat_counters>, '_origin_id': <function Storage._origin_id>, 'snapshot_add': <function Storage.snapshot_add>, '__module__': 'swh.storage.in_memory', 'revision_add': <function Storage.revision_add>, 'snapshot_count_branches': <function Storage.snapshot_count_branches>, 'object_find_by_sha1_git': <function Storage.object_find_by_sha1_git>, 'origin_get': <function Storage.origin_get>, '_directory_ls': <function Storage._directory_ls>, 'origin_search': <function Storage.origin_search>, 'origin_visit_get_latest': <function Storage.origin_visit_get_latest>, 'origin_visit_upsert': <function Storage.origin_visit_upsert>, 'origin_add': <function Storage.origin_add>, 'person_get': <function Storage.person_get>, 'tool_get': <function Storage.tool_get>, 'metadata_provider_add': <function Storage.metadata_provider_add>, '_directory_entry_get_by_path': <function Storage._directory_entry_get_by_path>, 'release_missing': <function Storage.release_missing>, 'origin_visit_find_by_date': <function Storage.origin_visit_find_by_date>, 'release_add': <function Storage.release_add>, 'snapshot_get': <function Storage.snapshot_get>, 'directory_entry_get_by_path': <function Storage.directory_entry_get_by_path>, 'origin_metadata_add': <function Storage.origin_metadata_add>, 'origin_visit_get': <function Storage.origin_visit_get>, 'metadata_provider_get': <function Storage.metadata_provider_get>, '__init__': <function Storage.__init__>, 'snapshot_get_branches': <function Storage.snapshot_get_branches>, 'revision_get': <function Storage.revision_get>, 'content_missing_per_sha1': <function Storage.content_missing_per_sha1>, 'directory_missing': <function Storage.directory_missing>, '__doc__': None, 'content_get': <function Storage.content_get>, 'content_get_range': <function Storage.content_get_range>, '__weakref__': <attribute '__weakref__' of 'Storage' objects>, '_join_dentry_to_content': <function Storage._join_dentry_to_content>, 'check_config': <function Storage.check_config>, 'reset': <function Storage.reset>, 'origin_visit_add': <function Storage.origin_visit_add>, 'release_get': <function Storage.release_get>, '__dict__': <attribute '__dict__' of 'Storage' objects>, 'content_missing': <function Storage.content_missing>, 'metadata_provider_get_by': <function Storage.metadata_provider_get_by>, 'fetch_history_start': <function Storage.fetch_history_start>, 'content_get_metadata': <function Storage.content_get_metadata>, 'origin_visit_update': <function Storage.origin_visit_update>, 'revision_log': <function Storage.revision_log>, 'revision_shortlog': <function Storage.revision_shortlog>, '_metadata_provider_key': <staticmethod object>, 'revision_missing': <function Storage.revision_missing>, 'origin_count': <function Storage.origin_count>, '_content_key': <staticmethod object>, 'content_find': <function Storage.content_find>, '_person_add': <function Storage._person_add>, 'snapshot_get_by_origin_visit': <function Storage.snapshot_get_by_origin_visit>, 'tool_add': <function Storage.tool_add>, 'snapshot_get_latest': <function Storage.snapshot_get_latest>, 'directory_ls': <function Storage.directory_ls>, '_tool_key': <staticmethod object>, 'origin_metadata_get_by': <function Storage.origin_metadata_get_by>, '_get_parent_revs': <function Storage._get_parent_revs>, 'origin_add_one': <function Storage.origin_add_one>, 'fetch_history_end': <function Storage.fetch_history_end>, '_content_add': <function Storage._content_add>, 'origin_get_range': <function Storage.origin_get_range>, 'directory_add': <function Storage.directory_add>, 'content_add': <function Storage.content_add>})
__module__ = 'swh.storage.in_memory'
__weakref__

list of weak references to the object (if defined)

static _metadata_provider_key(provider)[source]

swh.storage.journal_writer module

class swh.storage.journal_writer.InMemoryJournalWriter[source]

Bases: object

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

write_addition(object_type, object_)[source]
write_update(object_type, object_)
write_additions(object_type, objects)[source]
__dict__ = mappingproxy({'__weakref__': <attribute '__weakref__' of 'InMemoryJournalWriter' objects>, '__module__': 'swh.storage.journal_writer', 'write_additions': <function InMemoryJournalWriter.write_additions>, '__init__': <function InMemoryJournalWriter.__init__>, 'write_addition': <function InMemoryJournalWriter.write_addition>, 'write_update': <function InMemoryJournalWriter.write_addition>, '__dict__': <attribute '__dict__' of 'InMemoryJournalWriter' objects>, '__doc__': None})
__module__ = 'swh.storage.journal_writer'
__weakref__

list of weak references to the object (if defined)

swh.storage.journal_writer.get_journal_writer(cls, args={})[source]

swh.storage.storage module

swh.storage.storage.EMPTY_SNAPSHOT_ID = b'\x1a\x88\x93\xe6\xa8oDN\x8b\xe8\xe7\xbd\xa6\xcb4\xfb\x175\xa0\x0e'

Identifier for the empty snapshot

class swh.storage.storage.Storage(db, objstorage, min_pool_conns=1, max_pool_conns=10, journal_writer=None)[source]

Bases: object

SWH storage proxy, encompassing DB and object storage

__init__(db, objstorage, min_pool_conns=1, max_pool_conns=10, journal_writer=None)[source]
Parameters:
  • db_conn – either a libpq connection string, or a psycopg2 connection
  • obj_root – path to the root of the object storage
get_db()[source]
put_db(db)[source]
check_config(*, check_write, db, cur)[source]

Check that the storage is configured and ready to go.

_content_unique_key(hash, db)[source]

Given a hash (tuple or dict), return a unique key from the aggregation of keys.

_filter_new_content(content, db, cur)[source]
_content_add_metadata(db, cur, content_with_data, content_without_data)[source]
content_add(content, db, cur)[source]

Add content blobs to the storage

Note: in case of DB errors, objects might have already been added to the object storage and will not be removed. Since addition to the object storage is idempotent, that should not be a problem.

Parameters:

contents (iterable) –

iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys:

  • data (bytes): the actual content
  • length (int): content length (default: -1)
  • one key for each checksum algorithm in swh.model.hashutil.ALGORITHMS, mapped to the corresponding checksum
  • status (str): one of visible, hidden, absent
  • reason (str): if status = absent, the reason why
  • origin (int): if status = absent, the origin we saw the content in

Raises:
  • In case of errors, nothing is stored in the db (in the
  • objstorage, it could though). The following exceptions can
  • occur
    • HashCollision in case of collision
    • Any other exceptions raise by the db
Returns:

content:add: New contents added content:add:bytes: Sum of the contents’ length data skipped_content:add: New skipped contents (no data) added

Return type:

Summary dict with the following key and associated values

content_update(content, keys=[], db=None, cur=None)[source]

Update content blobs to the storage. Does nothing for unknown contents or skipped ones.

Parameters:
  • content (iterable) –

    iterable of dictionaries representing individual pieces of content to update. Each dictionary has the following keys:

    • data (bytes): the actual content
    • length (int): content length (default: -1)
    • one key for each checksum algorithm in swh.model.hashutil.ALGORITHMS, mapped to the corresponding checksum
    • status (str): one of visible, hidden, absent
  • keys (list) – List of keys (str) whose values needs an update, e.g., new hash column
content_add_metadata(content, db, cur)[source]

Add content metadata to the storage (like content_add, but without inserting to the objstorage).

Parameters:content (iterable) –

iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys:

  • length (int): content length (default: -1)
  • one key for each checksum algorithm in swh.model.hashutil.ALGORITHMS, mapped to the corresponding checksum
  • status (str): one of visible, hidden, absent
  • reason (str): if status = absent, the reason why
  • origin (int): if status = absent, the origin we saw the content in
  • ctime (datetime): time of insertion in the archive
Returns:content:add: New contents added skipped_content:add: New skipped contents (no data) added
Return type:Summary dict with the following key and associated values
content_get(content)[source]

Retrieve in bulk contents and their data.

This generator yields exactly as many items than provided sha1 identifiers, but callers should not assume this will always be true.

It may also yield None values in case an object was not found.

Parameters:

content – iterables of sha1

Yields:

Dict[str, bytes]

Generates streams of contents as dict with their

raw data:

  • sha1 (bytes): content id
  • data (bytes): content’s raw data
Raises:
  • ValueError in case of too much contents are required.
  • cf. BULK_BLOCK_CONTENT_LEN_MAX
content_get_range(start, end, limit=1000, db=None, cur=None)[source]

Retrieve contents within range [start, end] bound by limit.

Note that this function may return more than one blob per hash. The limit is enforced with multiplicity (ie. two blobs with the same hash will count twice toward the limit).

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **limit** (int) – Limit result (default to 1000)
Returns:

  • contents [dict]: iterable of contents in between the range.
  • next (bytes): There remains content in the range starting from this next sha1

Return type:

a dict with keys

content_get_metadata(content, db=None, cur=None)[source]

Retrieve content metadata in bulk

Parameters:content – iterable of content identifiers (sha1)
Returns:an iterable with content metadata corresponding to the given ids
content_missing(content, key_hash='sha1', db=None, cur=None)[source]

List content missing from storage

Parameters:
  • content ([dict]) – iterable of dictionaries whose keys are either ‘length’ or an item of swh.model.hashutil.ALGORITHMS; mapped to the corresponding checksum (or length).
  • key_hash (str) – name of the column to use as hash id result (default: ‘sha1’)
Returns:

missing content ids (as per the key_hash column)

Return type:

iterable ([bytes])

Raises:

TODO – an exception when we get a hash collision.

content_missing_per_sha1(contents, db=None, cur=None)[source]

List content missing from storage based only on sha1.

Parameters:contents – Iterable of sha1 to check for absence.
Returns:missing ids
Return type:iterable
Raises:TODO – an exception when we get a hash collision.
skipped_content_missing(contents, db=None, cur=None)[source]

List skipped_content missing from storage

Parameters:content – iterable of dictionaries containing the data for each checksum algorithm.
Returns:missing signatures
Return type:iterable
content_find(content, db=None, cur=None)[source]

Find a content hash in db.

Parameters:content – a dictionary representing one content hash, mapping checksum algorithm names (see swh.model.hashutil.ALGORITHMS) to checksum values
Returns:a triplet (sha1, sha1_git, sha256) if the content exist or None otherwise.
Raises:ValueError – in case the key of the dictionary is not sha1, sha1_git nor sha256.
directory_add(directories, db, cur)[source]

Add directories to the storage

Parameters:directories (iterable) –

iterable of dictionaries representing the individual directories to add. Each dict has the following keys:

  • id (sha1_git): the id of the directory to add
  • entries (list): list of dicts for each entry in the
    directory. Each dict has the following keys:
    • name (bytes)
    • type (one of ‘file’, ‘dir’, ‘rev’): type of the directory entry (file, directory, revision)
    • target (sha1_git): id of the object pointed at by the directory entry
    • perms (int): entry permissions
Returns:directory:add: Number of directories actually added
Return type:Summary dict of keys with associated count as values
directory_missing(directories, db=None, cur=None)[source]

List directories missing from storage

Parameters:directories (iterable) – an iterable of directory ids
Yields:missing directory ids
directory_ls(directory, recursive=False, db=None, cur=None)[source]

Get entries for one directory.

Parameters:
  • directory (-) – the directory to list entries from.
  • recursive (-) – if flag on, this list recursively from this directory.
Returns:

List of entries for such directory.

If recursive=True, names in the path of a dir/file not at the root are concatenated with a slash (/).

directory_entry_get_by_path(directory, paths, db=None, cur=None)[source]

Get the directory entry (either file or dir) from directory with path.

Parameters:
  • directory (-) – sha1 of the top level directory
  • paths (-) – path to lookup from the top level directory. From left (top) to right (bottom).
Returns:

The corresponding directory entry if found, None otherwise.

revision_add(revisions, db, cur)[source]

Add revisions to the storage

Parameters:revisions (Iterable[dict]) –

iterable of dictionaries representing the individual revisions to add. Each dict has the following keys:

  • id (sha1_git): id of the revision to add
  • date (dict): date the revision was written
  • committer_date (dict): date the revision got added to the origin
  • type (one of ‘git’, ‘tar’): type of the revision added
  • directory (sha1_git): the directory the revision points at
  • message (bytes): the message associated with the revision
  • author (Dict[str, bytes]): dictionary with keys: name, fullname, email
  • committer (Dict[str, bytes]): dictionary with keys: name, fullname, email
  • metadata (jsonb): extra information as dictionary
  • synthetic (bool): revision’s nature (tarball, directory creates synthetic revision`)
  • parents (list[sha1_git]): the parents of this revision

date dictionaries have the form defined in swh.model.

Returns:Summary dict of keys with associated count as values
revision:add: New objects actually stored in db
revision_missing(revisions, db=None, cur=None)[source]

List revisions missing from storage

Parameters:revisions (iterable) – revision ids
Yields:missing revision ids
revision_get(revisions, db=None, cur=None)[source]

Get all revisions from storage

Parameters:revisions – an iterable of revision ids
Returns:
an iterable of revisions as dictionaries (or None if the
revision doesn’t exist)
Return type:iterable
revision_log(revisions, limit=None, db=None, cur=None)[source]

Fetch revision entry from the given root revisions.

Parameters:
  • revisions – array of root revision to lookup
  • limit – limitation on the output result. Default to None.
Yields:

List of revision log from such revisions root.

revision_shortlog(revisions, limit=None, db=None, cur=None)[source]

Fetch the shortlog for the given revisions

Parameters:
  • revisions – list of root revisions to lookup
  • limit – depth limitation for the output
Yields:

a list of (id, parents) tuples.

release_add(releases, db, cur)[source]

Add releases to the storage

Parameters:releases (Iterable[dict]) –

iterable of dictionaries representing the individual releases to add. Each dict has the following keys:

  • id (sha1_git): id of the release to add
  • revision (sha1_git): id of the revision the release points to
  • date (dict): the date the release was made
  • name (bytes): the name of the release
  • comment (bytes): the comment associated with the release
  • author (Dict[str, bytes]): dictionary with keys: name, fullname, email

the date dictionary has the form defined in swh.model.

Returns:Summary dict of keys with associated count as values
release:add: New objects contents actually stored in db
release_missing(releases, db=None, cur=None)[source]

List releases missing from storage

Parameters:releases – an iterable of release ids
Returns:a list of missing release ids
release_get(releases, db=None, cur=None)[source]

Given a list of sha1, return the releases’s information

Parameters:releases – list of sha1s
Yields:dicts with the same keys as those given to release_add (or None if a release does not exist)
snapshot_add(snapshots, origin=None, visit=None, db=None, cur=None)[source]

Add snapshots to the storage.

Parameters:
  • snapshot ([dict]) –

    the snapshots to add, containing the following keys:

    • id (bytes): id of the snapshot
    • branches (dict): branches the snapshot contains, mapping the branch name (bytes) to the branch target, itself a dict (or None if the branch points to an unknown object)
      • target_type (str): one of content, directory, revision, release, snapshot, alias
      • target (bytes): identifier of the target (currently a sha1_git for all object kinds, or the name of the target branch for aliases)
  • origin (int) – legacy argument for backward compatibility
  • visit (int) – legacy argument for backward compatibility
Raises:

ValueError – if the origin or visit id does not exist.

Returns:

Summary dict of keys with associated count as values

snapshot:add: Count of object actually stored in db

snapshot_get(snapshot_id, db=None, cur=None)[source]

Get the content, possibly partial, of a snapshot with the given id

The branches of the snapshot are iterated in the lexicographical order of their names.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method snapshot_get_branches() should be used instead.

Parameters:snapshot_id (bytes) – identifier of the snapshot
Returns:
a dict with three keys:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than 1000 branches.
Return type:dict
snapshot_get_by_origin_visit(origin, visit, db=None, cur=None)[source]

Get the content, possibly partial, of a snapshot for the given origin visit

The branches of the snapshot are iterated in the lexicographical order of their names.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method snapshot_get_branches() should be used instead.

Parameters:
  • origin (int) – the origin identifier
  • visit (int) – the visit identifier
Returns:

None if the snapshot does not exist;
a dict with three keys otherwise:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than 1000 branches.

Return type:

dict

snapshot_get_latest(origin, allowed_statuses=None, db=None, cur=None)[source]

Get the content, possibly partial, of the latest snapshot for the given origin, optionally only from visits that have one of the given allowed_statuses

The branches of the snapshot are iterated in the lexicographical order of their names.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method snapshot_get_branches() should be used instead.

Parameters:
  • origin (Union[str,int]) – the origin’s URL or identifier
  • allowed_statuses (list of str) – list of visit statuses considered to find the latest snapshot for the visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.
Returns:

a dict with three keys:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than 1000 branches.

Return type:

dict

snapshot_count_branches(snapshot_id, db=None, cur=None)[source]

Count the number of branches in the snapshot with the given id

Parameters:snapshot_id (bytes) – identifier of the snapshot
Returns:A dict whose keys are the target types of branches and values their corresponding amount
Return type:dict
snapshot_get_branches(snapshot_id, branches_from=b'', branches_count=1000, target_types=None, db=None, cur=None)[source]

Get the content, possibly partial, of a snapshot with the given id

The branches of the snapshot are iterated in the lexicographical order of their names.

Parameters:
  • snapshot_id (bytes) – identifier of the snapshot
  • branches_from (bytes) – optional parameter used to skip branches whose name is lesser than it before returning them
  • branches_count (int) – optional parameter used to restrain the amount of returned branches
  • target_types (list) – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)
Returns:

None if the snapshot does not exist;
a dict with three keys otherwise:
  • id: identifier of the snapshot
  • branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
  • next_branch: the name of the first branch not returned or None if the snapshot has less than branches_count branches after branches_from included.

Return type:

dict

origin_visit_add(origin, date=None, type=None, db=None, cur=None, *, ts=None)[source]

Add an origin_visit for the origin at ts with status ‘ongoing’.

For backward compatibility, type is optional and defaults to the origin’s type.

Parameters:
  • origin (Union[int,str]) – visited origin’s identifier or URL
  • date – timestamp of such visit
  • type (str) – the type of loader used for the visit (hg, git, …)
Returns:

dictionary with keys origin and visit where:

  • origin: origin identifier
  • visit: the visit identifier for the new visit occurrence

Return type:

dict

origin_visit_update(origin, visit_id, status=None, metadata=None, snapshot=None, db=None, cur=None)[source]

Update an origin_visit’s status.

Parameters:
  • origin (Union[int,str]) – visited origin’s identifier or URL
  • visit_id – Visit’s id
  • status – Visit’s new status
  • metadata – Data associated to the visit
  • snapshot (sha1_git) – identifier of the snapshot to add to the visit
Returns:

None

origin_visit_upsert(visits, db=None, cur=None)[source]

Add a origin_visits with a specific id and with all its data. If there is already an origin_visit with the same (origin_id, visit_id), overwrites it.

Parameters:visits

iterable of dicts with keys:

origin: Visited Origin id visit: origin visit id date: timestamp of such visit status: Visit’s new status metadata: Data associated to the visit snapshot (sha1_git): identifier of the snapshot to add to

the visit
origin_visit_get(origin, last_visit=None, limit=None, db=None, cur=None)[source]

Retrieve all the origin’s visit’s information.

Parameters:
  • origin (Union[int,str]) – The occurrence’s origin (identifier/URL).
  • last_visit – Starting point from which listing the next visits Default to None
  • limit (int) – Number of results to return from the last visit. Default to None
Yields:

List of visits.

origin_visit_find_by_date(origin, visit_date, db=None, cur=None)[source]

Retrieves the origin visit whose date is closest to the provided timestamp. In case of a tie, the visit with largest id is selected.

Parameters:
  • origin (str) – The occurrence’s origin (URL).
  • target (datetime) – target timestamp
Returns:

A visit.

origin_visit_get_by(origin, visit, db=None, cur=None)[source]

Retrieve origin visit’s information.

Parameters:origin – The occurrence’s origin (identifier).
Returns:The information on that particular (origin, visit) or None if it does not exist
origin_visit_get_latest(origin, allowed_statuses=None, require_snapshot=False, db=None, cur=None)[source]

Get the latest origin visit for the given origin, optionally looking only for those with one of the given allowed_statuses or for those with a known snapshot.

Parameters:
  • origin (str) – the origin’s URL
  • allowed_statuses (list of str) – list of visit statuses considered to find the latest visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.
  • require_snapshot (bool) – If True, only a visit with a snapshot will be returned.
Returns:

a dict with the following keys:

origin: the URL of the origin visit: origin visit id type: type of loader used for the visit date: timestamp of such visit status: Visit’s new status metadata: Data associated to the visit snapshot (Optional[sha1_git]): identifier of the snapshot

associated to the visit

Return type:

dict

object_find_by_sha1_git(ids, db=None, cur=None)[source]

Return the objects found with the given ids.

Parameters:ids – a generator of sha1_gits
Returns:a mapping from id to the list of objects found. Each object found is itself a dict with keys:
  • sha1_git: the input id
  • type: the type of object found
  • id: the id of the object found
  • object_id: the numeric id of the object found.
Return type:dict
origin_keys = ['id', 'type', 'url']
origin_get(origins, db=None, cur=None)[source]

Return origins, either all identified by their ids or all identified by tuples (type, url).

If the url is given and the type is omitted, one of the origins with that url is returned.

Parameters:origin

a list of dictionaries representing the individual origins to find. These dicts have either the key url (and optionally type):

  • type (FIXME: enum TBD): the origin type (‘git’, ‘wget’, …)
  • url (bytes): the url the origin points to

or the id:

  • id: the origin id
Returns:the origin dictionary with the keys:
  • id: origin’s id
  • type: origin’s type
  • url: origin’s url
Return type:dict
Raises:ValueError – if the keys does not match (url and type) nor id.
origin_get_range(origin_from=1, origin_count=100, db=None, cur=None)[source]

Retrieve origin_count origins whose ids are greater or equal than origin_from.

Origins are sorted by id before retrieving them.

Parameters:
  • origin_from (int) – the minimum id of origins to retrieve
  • origin_count (int) – the maximum number of origins to retrieve
Yields:

dicts containing origin information as returned by swh.storage.storage.Storage.origin_get().

Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way.

Parameters:
  • url_pattern (str) – the string pattern to search for in origin urls
  • offset (int) – number of found origins to skip before returning results
  • limit (int) – the maximum number of found origins to return
  • regexp (bool) – if True, consider the provided pattern as a regular expression and return origins whose urls match it
  • with_visit (bool) – if True, filter out origins with no visit
Yields:

dicts containing origin information as returned by swh.storage.storage.Storage.origin_get().

origin_count(url_pattern, regexp=False, with_visit=False, db=None, cur=None)[source]

Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way.

Parameters:
  • url_pattern (str) – the string pattern to search for in origin urls
  • regexp (bool) – if True, consider the provided pattern as a regular expression and return origins whose urls match it
  • with_visit (bool) – if True, filter out origins with no visit
Returns:

The number of origins matching the search criterion.

Return type:

int

person_get(person, db=None, cur=None)[source]

Return the persons identified by their ids.

Parameters:person – array of ids.
Returns:The array of persons corresponding of the ids.
origin_add(origins, db=None, cur=None)[source]

Add origins to the storage

Parameters:origins

list of dictionaries representing the individual origins, with the following keys:

  • type: the origin type (‘git’, ‘svn’, ‘deb’, …)
  • url (bytes): the url the origin points to
Returns:given origins as dict updated with their id
Return type:list
origin_add_one(origin, db=None, cur=None)[source]

Add origin to the storage

Parameters:origin

dictionary representing the individual origin to add. This dict has the following keys:

  • type (FIXME: enum TBD): the origin type (‘git’, ‘wget’, …)
  • url (bytes): the url the origin points to
Returns:the id of the added origin, or of the identical one that already exists.
fetch_history_start(origin_id, db=None, cur=None)[source]

Add an entry for origin origin_id in fetch_history. Returns the id of the added fetch_history entry

fetch_history_end(fetch_history_id, data, db=None, cur=None)[source]

Close the fetch_history entry with id fetch_history_id, replacing its data with data.

fetch_history_get(fetch_history_id, db=None, cur=None)[source]

Get the fetch_history entry with id fetch_history_id.

stat_counters(db=None, cur=None)[source]

compute statistics about the number of tuples in various tables

Returns:a dictionary mapping textual labels (e.g., content) to integer values (e.g., the number of tuples in table content)
Return type:dict
refresh_stat_counters(db=None, cur=None)[source]

Recomputes the statistics for stat_counters.

origin_metadata_add(origin_id, ts, provider, tool, metadata, db=None, cur=None)[source]

Add an origin_metadata for the origin at ts with provenance and metadata.

Parameters:
  • origin_id (int) – the origin’s id for which the metadata is added
  • ts (datetime) – timestamp of the found metadata
  • provider (int) – the provider of metadata (ex:’hal’)
  • tool (int) – tool used to extract metadata
  • metadata (jsonb) – the metadata retrieved at the time and location
Returns:

the origin_metadata unique id

Return type:

id (int)

origin_metadata_get_by(origin_id, provider_type=None, db=None, cur=None)[source]

Retrieve list of all origin_metadata entries for the origin_id

Parameters:
  • origin_id (int) – the unique origin identifier
  • provider_type (str) – (optional) type of provider
Returns:

the origin_metadata dictionary with the keys:

  • origin_id (int): origin’s id
  • discovery_date (datetime): timestamp of discovery
  • tool_id (int): metadata’s extracting tool
  • metadata (jsonb)
  • provider_id (int): metadata’s provider
  • provider_name (str)
  • provider_type (str)
  • provider_url (str)

Return type:

list of dicts

tool_add(tools, db=None, cur=None)[source]

Add new tools to the storage.

Parameters:tools (iterable of dict) –

Tool information to add to storage. Each tool is a dict with the following keys:

  • name (str): name of the tool
  • version (str): version of the tool
  • configuration (dict): configuration of the tool, must be json-encodable
Returns:All the tools inserted in storage (including the internal id). The order of the list is not guaranteed to match the order of the initial list.
Return type:dict
tool_get(tool, db=None, cur=None)[source]

Retrieve tool information.

Parameters:tool (dict) – Tool information we want to retrieve from storage. The dicts have the same keys as those used in tool_add().
Returns:The full tool information if it exists (id included), None otherwise.
Return type:dict
metadata_provider_add(provider_name, provider_type, provider_url, metadata, db=None, cur=None)[source]

Add a metadata provider.

Parameters:
  • provider_name (str) – Its name
  • provider_type (str) – Its type (eg. ‘deposit-client’)
  • provider_url (str) – Its URL
  • metadata – JSON-encodable object
Returns:

an identifier of the provider

Return type:

int

metadata_provider_get(provider_id, db=None, cur=None)[source]

Get a metadata provider

Parameters:provider_id – Its identifier, as given by metadata_provider_add.
Returns:
same as metadata_provider_add;
or None if it does not exist.
Return type:dict
__dict__ = mappingproxy({'content_add_metadata': <function Storage.content_add_metadata>, 'fetch_history_get': <function Storage.fetch_history_get>, 'origin_visit_get_by': <function Storage.origin_visit_get_by>, 'stat_counters': <function Storage.stat_counters>, 'refresh_stat_counters': <function Storage.refresh_stat_counters>, 'skipped_content_missing': <function Storage.skipped_content_missing>, 'snapshot_add': <function Storage.snapshot_add>, '__module__': 'swh.storage.storage', 'revision_add': <function Storage.revision_add>, 'snapshot_count_branches': <function Storage.snapshot_count_branches>, 'object_find_by_sha1_git': <function Storage.object_find_by_sha1_git>, 'origin_get': <function Storage.origin_get>, 'origin_search': <function Storage.origin_search>, 'origin_visit_get_latest': <function Storage.origin_visit_get_latest>, 'origin_visit_upsert': <function Storage.origin_visit_upsert>, 'origin_add': <function Storage.origin_add>, 'person_get': <function Storage.person_get>, 'content_update': <function Storage.content_update>, 'tool_get': <function Storage.tool_get>, 'diff_directories': <function Storage.diff_directories>, 'release_missing': <function Storage.release_missing>, 'origin_visit_find_by_date': <function Storage.origin_visit_find_by_date>, 'release_add': <function Storage.release_add>, 'snapshot_get': <function Storage.snapshot_get>, 'directory_entry_get_by_path': <function Storage.directory_entry_get_by_path>, 'origin_metadata_add': <function Storage.origin_metadata_add>, 'origin_visit_get': <function Storage.origin_visit_get>, 'metadata_provider_get': <function Storage.metadata_provider_get>, '__init__': <function Storage.__init__>, '_content_unique_key': <function Storage._content_unique_key>, 'snapshot_get_branches': <function Storage.snapshot_get_branches>, 'revision_get': <function Storage.revision_get>, 'content_missing_per_sha1': <function Storage.content_missing_per_sha1>, 'directory_missing': <function Storage.directory_missing>, '__doc__': 'SWH storage proxy, encompassing DB and object storage\n\n ', 'content_get': <function Storage.content_get>, 'content_get_range': <function Storage.content_get_range>, '_content_add_metadata': <function Storage._content_add_metadata>, 'put_db': <function Storage.put_db>, 'origin_count': <function Storage.origin_count>, 'check_config': <function Storage.check_config>, 'diff_revision': <function Storage.diff_revision>, 'tool_add': <function Storage.tool_add>, 'origin_visit_add': <function Storage.origin_visit_add>, 'release_get': <function Storage.release_get>, '__dict__': <attribute '__dict__' of 'Storage' objects>, 'content_missing': <function Storage.content_missing>, 'origin_keys': ['id', 'type', 'url'], 'metadata_provider_get_by': <function Storage.metadata_provider_get_by>, 'fetch_history_start': <function Storage.fetch_history_start>, 'content_get_metadata': <function Storage.content_get_metadata>, 'origin_visit_update': <function Storage.origin_visit_update>, '__weakref__': <attribute '__weakref__' of 'Storage' objects>, 'revision_log': <function Storage.revision_log>, 'revision_shortlog': <function Storage.revision_shortlog>, 'metadata_provider_add': <function Storage.metadata_provider_add>, 'revision_missing': <function Storage.revision_missing>, 'diff_revisions': <function Storage.diff_revisions>, 'content_find': <function Storage.content_find>, '_filter_new_content': <function Storage._filter_new_content>, 'snapshot_get_by_origin_visit': <function Storage.snapshot_get_by_origin_visit>, 'get_db': <function Storage.get_db>, 'snapshot_get_latest': <function Storage.snapshot_get_latest>, 'directory_ls': <function Storage.directory_ls>, 'origin_metadata_get_by': <function Storage.origin_metadata_get_by>, 'origin_add_one': <function Storage.origin_add_one>, 'fetch_history_end': <function Storage.fetch_history_end>, 'origin_get_range': <function Storage.origin_get_range>, 'directory_add': <function Storage.directory_add>, 'content_add': <function Storage.content_add>})
__module__ = 'swh.storage.storage'
__weakref__

list of weak references to the object (if defined)

metadata_provider_get_by(provider, db=None, cur=None)[source]

Get a metadata provider

Parameters:provider (dict) – A dictionary with keys: * provider_name: Its name * provider_url: Its URL
Returns:
same as metadata_provider_add;
or None if it does not exist.
Return type:dict
diff_directories(from_dir, to_dir, track_renaming=False)[source]

Compute the list of file changes introduced between two arbitrary directories (insertion / deletion / modification / renaming of files).

Parameters:
  • from_dir (bytes) – identifier of the directory to compare from
  • to_dir (bytes) – identifier of the directory to compare to
  • track_renaming (bool) – whether or not to track files renaming
Returns:

A list of dict describing the introduced file changes (see swh.storage.algos.diff.diff_directories() for more details).

diff_revisions(from_rev, to_rev, track_renaming=False)[source]

Compute the list of file changes introduced between two arbitrary revisions (insertion / deletion / modification / renaming of files).

Parameters:
  • from_rev (bytes) – identifier of the revision to compare from
  • to_rev (bytes) – identifier of the revision to compare to
  • track_renaming (bool) – whether or not to track files renaming
Returns:

A list of dict describing the introduced file changes (see swh.storage.algos.diff.diff_directories() for more details).

diff_revision(revision, track_renaming=False)[source]

Compute the list of file changes introduced by a specific revision (insertion / deletion / modification / renaming of files) by comparing it against its first parent.

Parameters:
  • revision (bytes) – identifier of the revision from which to compute the list of files changes
  • track_renaming (bool) – whether or not to track files renaming
Returns:

A list of dict describing the introduced file changes (see swh.storage.algos.diff.diff_directories() for more details).

Module contents

exception swh.storage.HashCollision[source]

Bases: Exception

__module__ = 'swh.storage'
__weakref__

list of weak references to the object (if defined)

swh.storage.get_storage(cls, args)[source]

Get a storage object of class storage_class with arguments storage_args.

Parameters:
  • storage (dict) – dictionary with keys:
  • cls (-) – storage’s class, either ‘local’, ‘remote’, or ‘memory’
  • args (-) – dictionary with keys
Returns:

an instance of swh.storage.Storage (either local or remote)

Raises:

ValueError if passed an unknown storage class.