swh.storage.db module

class swh.storage.db.Db(conn: psycopg2.extensions.connection, pool: Optional[psycopg2.pool.AbstractConnectionPool] = None)[source]

Bases: swh.core.db.BaseDb

Proxy to the SWH DB, with wrappers around stored procedures

mktemp_dir_entry(entry_type, cur=None)[source]
register_listener(notify_queue, cur=None)[source]

Register a listener for NOTIFY queue notify_queue


Listen to notifications for timeout seconds

content_update_from_temp(keys_to_update, cur=None)[source]
content_get_metadata_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status']
content_add_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status', 'ctime']
skipped_content_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'reason', 'status', 'origin']
content_get_metadata_from_sha1s(sha1s, cur=None)[source]
content_get_range(start, end, limit=None, cur=None)[source]

Retrieve contents within range [start, end].

content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']
content_missing_from_list(contents, cur=None)[source]
content_missing_per_sha1(sha1s, cur=None)[source]
content_missing_per_sha1_git(contents, cur=None)[source]
skipped_content_missing(contents, cur=None)[source]
snapshot_exists(snapshot_id, cur=None)[source]

Check whether a snapshot with the given id exists

snapshot_missing_from_list(snapshots, cur=None)[source]
snapshot_add(snapshot_id, cur=None)[source]

Add a snapshot from the temporary table

snapshot_count_cols = ['target_type', 'count']
snapshot_count_branches(snapshot_id, cur=None)[source]
snapshot_get_cols = ['snapshot_id', 'name', 'target', 'target_type']
snapshot_get_by_id(snapshot_id, branches_from=b'', branches_count=None, target_types=None, cur=None)[source]
snapshot_get_by_origin_visit(origin_url, visit_id, cur=None)[source]
content_find_cols = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'ctime', 'status']
content_find(sha1=None, sha1_git=None, sha256=None, blake2s256=None, cur=None)[source]

Find the content optionally on a combination of the following checksums sha1, sha1_git, sha256 or blake2s256.

  • sha1 – sha1 content

  • git_sha1 – the sha1 computed a la git sha1 of the content

  • sha256 – sha256 content

  • blake2s256 – blake2s256 content


The tuple (sha1, sha1_git, sha256, blake2s256) if found or None.

directory_missing_from_list(directories, cur=None)[source]
directory_ls_cols = ['dir_id', 'type', 'target', 'name', 'perms', 'status', 'sha1', 'sha1_git', 'sha256', 'length']
directory_walk_one(directory, cur=None)[source]
directory_walk(directory, cur=None)[source]
directory_entry_get_by_path(directory, paths, cur=None)[source]

Retrieve a directory entry by path.

revision_missing_from_list(revisions, cur=None)[source]
revision_add_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'extra_headers']
revision_get_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'extra_headers', 'parents']
origin_visit_add(origin, ts, type, cur=None)[source]

Add a new origin_visit for origin origin at timestamp ts.

  • origin – origin concerned by the visit

  • ts – the date of the visit

  • type – type of loader for the visit


The new visit index step for that origin

origin_visit_status_cols = ['origin', 'visit', 'date', 'status', 'snapshot', 'metadata']
origin_visit_status_add(visit_status: swh.model.model.OriginVisitStatus, cur=None) → None[source]

Add new origin visit status

origin_visit_add_with_id(origin_visit: swh.model.model.OriginVisit, cur=None) → None[source]

Insert origin visit when id are already set

origin_visit_get_cols = ['origin', 'visit', 'date', 'type', 'status', 'metadata', 'snapshot']
origin_visit_select_cols = ['o.url AS origin', 'ov.visit', 'ov.date', 'ov.type AS type', 'ovs.status', 'ovs.metadata', 'ovs.snapshot']
origin_visit_status_select_cols = ['o.url AS origin', 'ovs.visit', 'ovs.date', 'ovs.status', 'ovs.snapshot', 'ovs.metadata']
origin_visit_status_get_latest(origin_url: str, visit: int, allowed_statuses: Optional[List[str]] = None, require_snapshot: bool = False, cur=None) → Optional[Dict[str, Any]][source]

Given an origin visit id, return its latest origin_visit_status

origin_visit_get_all(origin_id, last_visit=None, order='asc', limit=None, cur=None)[source]

Retrieve all visits for origin with id origin_id.


origin_id – The occurrence’s origin


The visits for that origin

origin_visit_get(origin_id, visit_id, cur=None)[source]

Retrieve information on visit visit_id of origin origin_id.

  • origin_id – the origin concerned

  • visit_id – The visit step for that origin


The origin_visit information

origin_visit_find_by_date(origin, visit_date, cur=None)[source]
origin_visit_exists(origin_id, visit_id, cur=None)[source]

Check whether an origin visit with the given ids exists

origin_visit_get_latest(origin_id: str, type: Optional[str], allowed_statuses: Optional[Iterable[str]], require_snapshot: bool, cur=None)[source]

Retrieve the most recent origin_visit of the given origin, with optional filters.

  • origin_id – the origin concerned

  • type – Optional visit type to filter on

  • allowed_statuses – the visit statuses allowed for the returned visit

  • require_snapshot (bool) – If True, only a visit with a known snapshot will be returned.


The origin_visit information, or None if no visit matches.

origin_visit_get_random(type, cur=None)[source]

Randomly select one origin visit that was full and in the last 3 months

static mangle_query_key(key, main_table)[source]
revision_get_from_list(revisions, cur=None)[source]
revision_log(root_revisions, limit=None, cur=None)[source]
revision_shortlog_cols = ['id', 'parents']
revision_shortlog(root_revisions, limit=None, cur=None)[source]
release_missing_from_list(releases, cur=None)[source]
object_find_by_sha1_git_cols = ['sha1_git', 'type']
object_find_by_sha1_git(ids, cur=None)[source]
origin_add(url, cur=None)[source]

Insert a new origin and return the new identifier.

origin_cols = ['url']
origin_get_by_url(origins, cur=None)[source]

Retrieve origin (type, url) from urls if found.

origin_get_by_sha1(sha1s, cur=None)[source]

Retrieve origin urls from sha1s if found.

origin_id_get_by_url(origins, cur=None)[source]

Retrieve origin (type, url) from urls if found.

origin_get_range_cols = ['id', 'url']
origin_get_range(origin_from=1, origin_count=100, cur=None)[source]

Retrieve origin_count origins whose ids are greater or equal than origin_from.

Origins are sorted by id before retrieving them.

  • origin_from (int) – the minimum id of origins to retrieve

  • origin_count (int) – the maximum number of origins to retrieve

Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way.

  • url_pattern (str) – the string pattern to search for in origin urls

  • offset (int) – number of found origins to skip before returning results

  • limit (int) – the maximum number of found origins to return

  • regexp (bool) – if True, consider the provided pattern as a regular expression and returns origins whose urls match it

  • with_visit (bool) – if True, filter out origins with no visit

origin_count(url_pattern, regexp=False, with_visit=False, cur=None)[source]

Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way.

  • url_pattern (str) – the string pattern to search for in origin urls

  • regexp (bool) – if True, consider the provided pattern as a regular expression and returns origins whose urls match it

  • with_visit (bool) – if True, filter out origins with no visit

release_add_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'name', 'comment', 'synthetic', 'author_fullname', 'author_name', 'author_email']
release_get_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'name', 'comment', 'synthetic', 'author_fullname', 'author_name', 'author_email']
release_get_from_list(releases, cur=None)[source]
object_metadata_get_cols = ['id', 'discovery_date', 'metadata_authority.type', 'metadata_authority.url', 'metadata_fetcher.id', 'metadata_fetcher.name', 'metadata_fetcher.version', 'origin', 'visit', 'snapshot', 'release', 'revision', 'path', 'directory', 'format', 'metadata']

List of columns of the object_metadata, metadata_authority, and metadata_fetcher tables, used when reading object metadata.

object_metadata_add(object_type: str, id: str, context: Dict[str, Union[str, bytes, int]], discovery_date: datetime.datetime, authority_id: int, fetcher_id: int, format: str, metadata: bytes, cur)[source]
object_metadata_get(object_type: str, id: str, authority_id: int, after_time: Optional[datetime.datetime], after_fetcher: Optional[int], limit: int, cur)[source]
metadata_fetcher_cols = ['name', 'version', 'metadata']
metadata_fetcher_add(name: str, version: str, metadata: bytes, cur=None) → None[source]
metadata_fetcher_get(name: str, version: str, cur=None)[source]
metadata_fetcher_get_id(name: str, version: str, cur=None) → Optional[int][source]
metadata_authority_cols = ['type', 'url', 'metadata']
metadata_authority_add(type: str, url: str, metadata: bytes, cur=None) → None[source]
metadata_authority_get(type: str, url: str, cur=None)[source]
metadata_authority_get_id(type: str, url: str, cur=None) → Optional[int][source]