swh.storage.postgresql.db module#
- class swh.storage.postgresql.db.QueryBuilder[source]#
Bases:
object
- add_pagination_clause(pagination_key: List[str], cursor: Any | None, direction: ListOrder | None, limit: int | None, separator: str = 'AND') None [source]#
Create and add a pagination clause to the query
- Parameters:
pagination_key – Pagination key to be used. Use list of strings to support alias fields
cursor – Pagination cursor as a query parameter
direction – Sort order
limit – Limit as a query parameter
separator – Separator to be used as the prefix for the clause
- class swh.storage.postgresql.db.Db(conn: connection, pool: AbstractConnectionPool | None = None)[source]#
Bases:
BaseDb
Proxy to the SWH DB, with wrappers around stored procedures
create a DB proxy
- Parameters:
conn – psycopg2 connection to the SWH DB
pool – psycopg2 pool of connections
- content_get_metadata_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status']#
- content_add_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status', 'ctime']#
- skipped_content_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'reason', 'status', 'origin']#
- content_get_range(start, end, limit=None, cur=None) Iterator[Tuple] [source]#
Retrieve contents within range [start, end].
- content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']#
- content_find_cols = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'ctime', 'status']#
- content_find(sha1: bytes | None = None, sha1_git: bytes | None = None, sha256: bytes | None = None, blake2s256: bytes | None = None, cur=None)[source]#
Find the content optionally on a combination of the following checksums sha1, sha1_git, sha256 or blake2s256.
- Parameters:
sha1 – sha1 content
git_sha1 – the sha1 computed a la git sha1 of the content
sha256 – sha256 content
blake2s256 – blake2s256 content
- Returns:
The tuple (sha1, sha1_git, sha256, blake2s256) if found or None.
- skipped_content_find_cols = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status', 'reason', 'ctime']#
- skipped_content_find(sha1: bytes | None = None, sha1_git: bytes | None = None, sha256: bytes | None = None, blake2s256: bytes | None = None, cur=None) List[Tuple[Any]] [source]#
- directory_ls_cols = ['dir_id', 'type', 'target', 'name', 'perms', 'status', 'sha1', 'sha1_git', 'sha256', 'length']#
- directory_entry_get_by_path(directory, paths, cur=None)[source]#
Retrieve a directory entry by path.
- directory_get_entries_cols = ['type', 'target', 'name', 'perms']#
- directory_get_raw_manifest(directory_ids: List[bytes], cur=None) Iterable[Tuple[bytes, bytes]] [source]#
- revision_add_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'committer_date_offset_bytes', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'extra_headers', 'raw_manifest']#
- revision_get_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'committer_date_offset_bytes', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'extra_headers', 'raw_manifest', 'parents']#
- revision_shortlog_cols = ['id', 'parents']#
- extid_cols = ['extid', 'extid_version', 'extid_type', 'target', 'target_type']#
- extid_get_from_extid_list(extid_type: str, ids: List[bytes], version: int | None = None, cur=None)[source]#
- extid_get_from_swhid_list(target_type: str, ids: List[bytes], extid_version: int | None = None, extid_type: str | None = None, cur=None)[source]#
- release_add_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'name', 'comment', 'synthetic', 'raw_manifest', 'author_fullname', 'author_name', 'author_email']#
- release_get_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'name', 'comment', 'synthetic', 'raw_manifest', 'author_fullname', 'author_name', 'author_email']#
- snapshot_count_cols = ['target_type', 'count']#
- snapshot_get_cols = ['snapshot_id', 'name', 'target', 'target_type']#
- snapshot_get_by_id(snapshot_id, branches_from=b'', branches_count=None, target_types=None, branch_name_include_substring=None, branch_name_exclude_prefix=None, cur=None)[source]#
- origin_visit_add(origin, ts, type, cur=None)[source]#
Add a new origin_visit for origin origin at timestamp ts.
- Parameters:
origin – origin concerned by the visit
ts – the date of the visit
type – type of loader for the visit
- Returns:
The new visit index step for that origin
- origin_visit_status_cols = ['origin', 'visit', 'date', 'type', 'status', 'snapshot', 'metadata']#
- origin_visit_status_add(visit_status: OriginVisitStatus, cur=None) None [source]#
Add new origin visit status
- origin_visit_cols = ['origin', 'visit', 'date', 'type']#
- origin_visit_add_with_id(origin_visit: OriginVisit, cur=None) None [source]#
Insert origin visit when id are already set
- origin_visit_get_cols = ['origin', 'visit', 'date', 'type', 'status', 'metadata', 'snapshot']#
- origin_visit_select_cols = ['o.url AS origin', 'ov.visit', 'ov.date', 'ov.type AS type', 'ovs.status', 'ovs.snapshot', 'ovs.metadata']#
- origin_visit_status_select_cols = ['o.url AS origin', 'ovs.visit', 'ovs.date', 'ovs.type AS type', 'ovs.status', 'ovs.snapshot', 'ovs.metadata']#
- origin_visit_status_get_latest(origin_url: str, visit: int, allowed_statuses: List[str] | None = None, require_snapshot: bool = False, cur=None) Dict[str, Any] | None [source]#
Given an origin visit id, return its latest origin_visit_status
- origin_visit_status_get_range(origin: str, visit: int, date_from: datetime | None, order: ListOrder, limit: int, cur=None)[source]#
Retrieve visit_status rows for visit (origin, visit) in a paginated way.
- origin_visit_get_range(origin: str, visit_from: int, order: ListOrder, limit: int, cur=None)[source]#
- origin_visit_status_get_all_in_range(origin: str, allowed_statuses: List[str] | None, require_snapshot: bool, visit_from: int, visit_to: int, cur=None)[source]#
- origin_visit_get(origin_id, visit_id, cur=None)[source]#
Retrieve information on visit visit_id of origin origin_id.
- Parameters:
origin_id – the origin concerned
visit_id – The visit step for that origin
- Returns:
The origin_visit information
- origin_visit_exists(origin_id, visit_id, cur=None)[source]#
Check whether an origin visit with the given ids exists
- origin_visit_get_latest(origin_id: str, type: str | None, allowed_statuses: Iterable[str] | None, require_snapshot: bool, cur=None)[source]#
Retrieve the most recent origin_visit of the given origin, with optional filters.
- Parameters:
origin_id – the origin concerned
type – Optional visit type to filter on
allowed_statuses – the visit statuses allowed for the returned visit
require_snapshot (bool) – If True, only a visit with a known snapshot will be returned.
- Returns:
The origin_visit information, or None if no visit matches.
- origin_visit_get_random(type, cur=None)[source]#
Randomly select one origin visit that was full and in the last 3 months
- origin_cols = ['url']#
- origin_get_range_cols = ['id', 'url']#
- origin_get_range(origin_from: int = 1, origin_count: int = 100, cur=None)[source]#
Retrieve
origin_count
origins whose ids are greater or equal thanorigin_from
.Origins are sorted by id before retrieving them.
- Parameters:
origin_from – the minimum id of origins to retrieve
origin_count – the maximum number of origins to retrieve
- origin_search(url_pattern: str, offset: int = 0, limit: int = 50, regexp: bool = False, with_visit: bool = False, visit_types: List[str] | None = None, cur=None)[source]#
Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way.
- Parameters:
url_pattern – the string pattern to search for in origin urls
offset – number of found origins to skip before returning results
limit – the maximum number of found origins to return
regexp – if True, consider the provided pattern as a regular expression and returns origins whose urls match it
with_visit – if True, filter out origins with no visit
- origin_count(url_pattern, regexp=False, with_visit=False, cur=None)[source]#
Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way.
- object_find_by_sha1_git_cols = ['sha1_git', 'type']#
- raw_extrinsic_metadata_get_cols = ['raw_extrinsic_metadata.target', 'raw_extrinsic_metadata.type', 'discovery_date', 'metadata_authority.type', 'metadata_authority.url', 'metadata_fetcher.id', 'metadata_fetcher.name', 'metadata_fetcher.version', 'origin', 'visit', 'snapshot', 'release', 'revision', 'path', 'directory', 'format', 'raw_extrinsic_metadata.metadata']#
List of columns of the raw_extrinsic_metadata, metadata_authority, and metadata_fetcher tables, used when reading object metadata.
- raw_extrinsic_metadata_add(id: bytes, type: str, target: str, discovery_date: datetime, authority_id: int, fetcher_id: int, format: str, metadata: bytes, origin: str | None, visit: int | None, snapshot: str | None, release: str | None, revision: str | None, path: bytes | None, directory: str | None, cur)[source]#
- raw_extrinsic_metadata_get(target: str, authority_id: int, after_time: datetime | None, after_fetcher: int | None, limit: int, cur)[source]#
- metadata_fetcher_cols = ['name', 'version']#
- metadata_authority_cols = ['type', 'url']#
- object_references_create_partition(year: int, week: int, cur=None) Tuple[date, date] [source]#
Create the partition of the object_references table for the given ISO
year
andweek
.
- object_references_drop_partition(year: int, week: int, cur=None) None [source]#
Delete the partition of the object_references table for the given ISO
year
andweek
.
- object_references_list_partitions(cur=None) List[ObjectReferencesPartition] [source]#
List existing partitions of the object_references table, ordered from oldest to the most recent.