swh.storage.api.client module¶
-
class
swh.storage.api.client.
RemoteStorage
(url, api_exception=None, timeout=None, chunk_size=4096, reraise_exceptions=None, **kwargs)[source]¶ Bases:
swh.core.api.RPCClient
Proxy to a remote storage API
-
api_exception
¶ alias of
swh.storage.exc.StorageAPIError
-
backend_class
¶
-
reraise_exceptions
: ClassVar[List[Type[Exception]]] = [<class 'swh.storage.exc.StorageArgumentException'>]¶
-
extra_type_decoders
: Dict[str, Callable] = {'model': <function <lambda>>, 'model_enum': <function _decode_model_enum>, 'storage_enum': <function _decode_storage_enum>, 'swhid': <function parse_swhid>}¶
-
extra_type_encoders
: List[Tuple[type, str, Callable]] = [(<class 'swh.model.model.BaseModel'>, 'model', <function _encode_model_object>), (<class 'swh.model.identifiers.SWHID'>, 'swhid', <class 'str'>), (<enum 'MetadataTargetType'>, 'model_enum', <function _encode_enum>), (<enum 'MetadataAuthorityType'>, 'model_enum', <function _encode_enum>), (<enum 'ListOrder'>, 'storage_enum', <function _encode_enum>)]¶
-
raise_for_status
(response) → None[source]¶ check response HTTP status code and raise an exception if it denotes an error; do nothing otherwise
-
content_add
(content: Iterable[Union[swh.model.model.Content, Dict[str, Any]]])[source]¶
-
check_config
(*, check_write: bool) → bool¶ Check that the storage is configured and ready to go.
-
clear_buffers
(object_types: Sequence[str] = ()) → None¶ For backend storages (pg, storage, in-memory), this is a noop operation. For proxy storages (especially filter, buffer), this is an operation which cleans internal state.
-
content_add_metadata
(content: List[swh.model.model.Content]) → Dict¶ Add content metadata to the storage (like content_add, but without inserting to the objstorage).
- Parameters
content (iterable) –
iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys:
length (int): content length (default: -1)
one key for each checksum algorithm in
swh.model.hashutil.ALGORITHMS
, mapped to the corresponding checksumstatus (str): one of visible, hidden, absent
reason (str): if status = absent, the reason why
origin (int): if status = absent, the origin we saw the content in
ctime (datetime): time of insertion in the archive
- Returns
content:add: New contents added skipped_content:add: New skipped contents (no data) added
- Return type
Summary dict with the following key and associated values
-
content_find
(content: Dict[str, Any]) → List[swh.model.model.Content]¶ Find a content hash in db.
- Parameters
content – a dictionary representing one content hash, mapping checksum algorithm names (see swh.model.hashutil.ALGORITHMS) to checksum values
- Raises
ValueError – in case the key of the dictionary is not sha1, sha1_git nor sha256.
- Returns
an iterable of Content objects matching the search criteria if the content exist. Empty iterable otherwise.
-
content_get
(contents: List[bytes]) → List[Optional[swh.model.model.Content]]¶ Retrieve content metadata in bulk
- Parameters
content – List of content identifiers
- Returns
List of contents model objects when they exist, None otherwise.
-
content_get_data
(content: bytes) → Optional[bytes]¶ Given a content identifier, returns its associated data if any.
- Parameters
content – sha1 identifier
- Returns
raw content data (bytes)
-
content_get_partition
(partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) → swh.core.api.classes.PagedResult[swh.model.model.Content, str]¶ Splits contents into nb_partitions, and returns one of these based on partition_id (which must be in [0, nb_partitions-1])
There is no guarantee on how the partitioning is done, or the result order.
- Parameters
partition_id – index of the partition to fetch
nb_partitions – total number of partitions to split into
page_token – opaque token used for pagination.
limit – Limit result (default to 1000)
- Returns
PagedResult of Content model objects within the partition. If next_page_token is None, there is no longer data to retrieve.
-
content_get_random
() → bytes¶ Finds a random content id.
- Returns
a sha1_git
-
content_missing
(contents: List[Dict[str, Any]], key_hash: str = 'sha1') → Iterable[bytes]¶ List content missing from storage
- Parameters
content – iterable of dictionaries whose keys are either ‘length’ or an item of
swh.model.hashutil.ALGORITHMS
; mapped to the corresponding checksum (or length).key_hash – name of the column to use as hash id result (default: ‘sha1’)
- Raises
StorageArgumentException when key_hash is unknown. –
TODO – an exception when we get a hash collision.
- Returns
iterable of missing content ids (as per the key_hash column)
-
content_missing_per_sha1
(contents: List[bytes]) → Iterable[bytes]¶ List content missing from storage based only on sha1.
- Parameters
contents – List of sha1 to check for absence.
- Raises
TODO – an exception when we get a hash collision.
- Returns
Iterable of missing content ids (sha1)
-
content_missing_per_sha1_git
(contents: List[bytes]) → Iterable[bytes]¶ List content missing from storage based only on sha1_git.
- Parameters
contents (List) – An iterable of content id (sha1_git)
- Yields
missing contents sha1_git
-
content_update
(contents: List[Dict[str, Any]], keys: List[str] = []) → None¶ Update content blobs to the storage. Does nothing for unknown contents or skipped ones.
- Parameters
content –
iterable of dictionaries representing individual pieces of content to update. Each dictionary has the following keys:
data (bytes): the actual content
length (int): content length (default: -1)
one key for each checksum algorithm in
swh.model.hashutil.ALGORITHMS
, mapped to the corresponding checksumstatus (str): one of visible, hidden, absent
keys (list) – List of keys (str) whose values needs an update, e.g., new hash column
-
directory_add
(directories: List[swh.model.model.Directory]) → Dict¶ Add directories to the storage
- Parameters
directories (iterable) –
iterable of dictionaries representing the individual directories to add. Each dict has the following keys:
id (sha1_git): the id of the directory to add
- entries (list): list of dicts for each entry in the
directory. Each dict has the following keys:
name (bytes)
type (one of ‘file’, ‘dir’, ‘rev’): type of the directory entry (file, directory, revision)
target (sha1_git): id of the object pointed at by the directory entry
perms (int): entry permissions
- Returns
directory:add: Number of directories actually added
- Return type
Summary dict of keys with associated count as values
-
directory_entry_get_by_path
(directory: bytes, paths: List[bytes]) → Optional[Dict[str, Any]]¶ Get the directory entry (either file or dir) from directory with path.
- Parameters
directory – directory id
paths – path to lookup from the top level directory. From left (top) to right (bottom).
- Returns
The corresponding directory entry as dict if found, None otherwise.
-
directory_get_random
() → bytes¶ Finds a random directory id.
- Returns
a sha1_git
-
directory_ls
(directory: bytes, recursive: bool = False) → Iterable[Dict[str, Any]]¶ List entries for one directory.
If recursive=True, names in the path of a dir/file not at the root are concatenated with a slash (/).
- Parameters
directory – the directory to list entries from.
recursive – if flag on, this list recursively from this directory.
- Yields
directory entries for such directory.
-
directory_missing
(directories: List[bytes]) → Iterable[bytes]¶ List directories missing from storage.
- Parameters
directories – list of directory ids
- Yields
missing directory ids
-
flush
(object_types: Sequence[str] = ()) → Dict[str, int]¶ For backend storages (pg, storage, in-memory), this is expected to be a noop operation. For proxy storages (especially buffer), this is expected to trigger actual writes to the backend.
Add new metadata authorities to the storage.
Their type and url together are unique identifiers of this authority; and metadata is an arbitrary dict of JSONable data with information about this authority, which must not be None (but may be empty).
- Parameters
authorities – iterable of MetadataAuthority to be inserted
Retrieve information about an authority
- Parameters
type – one of “deposit_client”, “forge”, or “registry”
url – unique URI identifying the authority
- Returns
a MetadataAuthority object (with a non-None metadata field) if it is known, else None.
-
metadata_fetcher_add
(fetchers: List[swh.model.model.MetadataFetcher]) → None¶ Add new metadata fetchers to the storage.
Their name and version together are unique identifiers of this fetcher; and metadata is an arbitrary dict of JSONable data with information about this fetcher, which must not be None (but may be empty).
- Parameters
fetchers – iterable of MetadataFetcher to be inserted
-
metadata_fetcher_get
(name: str, version: str) → Optional[swh.model.model.MetadataFetcher]¶ Retrieve information about a fetcher
- Parameters
name – the name of the fetcher
version – version of the fetcher
- Returns
a MetadataFetcher object (with a non-None metadata field) if it is known, else None.
-
object_find_by_sha1_git
(ids: List[bytes]) → Dict[bytes, List[Dict]]¶ Return the objects found with the given ids.
- Parameters
ids – a generator of sha1_gits
- Returns
A dict from id to the list of objects found for that id. Each object found is itself a dict with keys:
sha1_git: the input id
type: the type of object found
-
origin_add
(origins: List[swh.model.model.Origin]) → Dict[str, int]¶ Add origins to the storage
- Parameters
origins –
list of dictionaries representing the individual origins, with the following keys:
type: the origin type (‘git’, ‘svn’, ‘deb’, …)
url (bytes): the url the origin points to
- Returns
Summary dict of keys with associated count as values
origin:add: Count of object actually stored in db
-
origin_count
(url_pattern: str, regexp: bool = False, with_visit: bool = False) → int¶ Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way.
- Parameters
url_pattern (str) – the string pattern to search for in origin urls
regexp (bool) – if True, consider the provided pattern as a regular expression and return origins whose urls match it
with_visit (bool) – if True, filter out origins with no visit
- Returns
The number of origins matching the search criterion.
- Return type
int
-
origin_get
(origins: List[str]) → Iterable[Optional[swh.model.model.Origin]]¶ Return origins.
- Parameters
origin – a list of urls to find
- Returns
the list of associated existing origin model objects. The unknown origins will be returned as None at the same index as the input.
-
origin_get_by_sha1
(sha1s: List[bytes]) → List[Optional[Dict[str, Any]]]¶ Return origins, identified by the sha1 of their URLs.
- Parameters
sha1s – a list of sha1s
- Returns
List of origins dict whose sha1 of their url match, None otherwise.
-
origin_list
(page_token: Optional[str] = None, limit: int = 100) → swh.core.api.classes.PagedResult[swh.model.model.Origin, str]¶ Returns the list of origins
- Parameters
page_token – opaque token used for pagination.
limit – the maximum number of results to return
- Returns
Page of Origin data model objects. if next_page_token is None, there is no longer data to retrieve.
-
origin_search
(url_pattern: str, page_token: Optional[str] = None, limit: int = 50, regexp: bool = False, with_visit: bool = False) → swh.core.api.classes.PagedResult[swh.model.model.Origin, str]¶ Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way.
- Parameters
url_pattern – the string pattern to search for in origin urls
page_token – opaque token used for pagination
limit – the maximum number of found origins to return
regexp – if True, consider the provided pattern as a regular expression and return origins whose urls match it
with_visit – if True, filter out origins with no visit
- Yields
PagedResult of Origin
-
origin_visit_add
(visits: List[swh.model.model.OriginVisit]) → Iterable[swh.model.model.OriginVisit]¶ Add visits to storage. If the visits have no id, they will be created and assigned one. The resulted visits are visits with their visit id set.
- Parameters
visits – List of OriginVisit objects to add
- Raises
StorageArgumentException if some origin visit reference unknown origins –
- Returns
List[OriginVisit] stored
-
origin_visit_find_by_date
(origin: str, visit_date: datetime.datetime) → Optional[swh.model.model.OriginVisit]¶ Retrieves the origin visit whose date is closest to the provided timestamp. In case of a tie, the visit with largest id is selected.
- Parameters
origin – origin (URL)
visit_date – expected visit date
- Returns
A visit if found, None otherwise
-
origin_visit_get
(origin: str, page_token: Optional[str] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) → swh.core.api.classes.PagedResult[swh.model.model.OriginVisit, str]¶ Retrieve page of OriginVisit information.
- Parameters
origin – The visited origin
page_token – opaque string used to get the next results of a search
order – Order on visit id fields to list origin visits (default to asc)
limit – Number of visits to return
- Raises
StorageArgumentException if the order is wrong or the page_token type is –
mistyped. –
- Returns: Page of OriginVisit data model objects. if next_page_token is None,
there is no longer data to retrieve.
-
origin_visit_get_by
(origin: str, visit: int) → Optional[swh.model.model.OriginVisit]¶ Retrieve origin visit’s information.
- Parameters
origin – origin (URL)
visit – visit id
- Returns
The information on that particular OriginVisit or None if it does not exist
-
origin_visit_get_latest
(origin: str, type: Optional[str] = None, allowed_statuses: Optional[List[str]] = None, require_snapshot: bool = False) → Optional[swh.model.model.OriginVisit]¶ Get the latest origin visit for the given origin, optionally looking only for those with one of the given allowed_statuses or for those with a snapshot.
- Parameters
origin – origin URL
type – Optional visit type to filter on (e.g git, tar, dsc, svn,
hg –
npm –
pypi –
..) –
allowed_statuses – list of visit statuses considered to find the latest visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.require_snapshot – If True, only a visit with a snapshot will be returned.
- Raises
StorageArgumentException if values for the allowed_statuses parameters –
are unknown –
- Returns
OriginVisit matching the criteria if found, None otherwise. Note that as OriginVisit no longer held reference on the visit status or snapshot, you may want to use origin_visit_status_get_latest for those information.
-
origin_visit_status_add
(visit_statuses: List[swh.model.model.OriginVisitStatus]) → None¶ Add origin visit statuses.
If there is already a status for the same origin and visit id at the same date, the new one will be either dropped or will replace the existing one (it is unspecified which one of these two behaviors happens).
- Parameters
visit_statuses – origin visit statuses to add
Raises: StorageArgumentException if the origin of the visit status is unknown
-
origin_visit_status_get
(origin: str, visit: int, page_token: Optional[str] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) → swh.core.api.classes.PagedResult[swh.model.model.OriginVisitStatus, str]¶ Retrieve page of OriginVisitStatus information.
- Parameters
origin – The visited origin
visit – The visit identifier
page_token – opaque string used to get the next results of a search
order – Order on visit status objects to list (default to asc)
limit – Number of visit statuses to return
- Returns: Page of OriginVisitStatus data model objects. if next_page_token is
None, there is no longer data to retrieve.
-
origin_visit_status_get_latest
(origin_url: str, visit: int, allowed_statuses: Optional[List[str]] = None, require_snapshot: bool = False) → Optional[swh.model.model.OriginVisitStatus]¶ Get the latest origin visit status for the given origin visit, optionally looking only for those with one of the given allowed_statuses or with a snapshot.
- Parameters
origin – origin URL
allowed_statuses – list of visit statuses considered to find the latest visit. Possible values are {created, ongoing, partial, full}. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.require_snapshot – If True, only a visit with a snapshot will be returned.
- Raises
StorageArgumentException if values for the allowed_statuses parameters –
are unknown –
- Returns
The OriginVisitStatus matching the criteria
-
origin_visit_status_get_random
(type: str) → Optional[Tuple[swh.model.model.OriginVisit, swh.model.model.OriginVisitStatus]]¶ Randomly select one successful origin visit with <type> made in the last 3 months.
- Returns
One random tuple of (OriginVisit, OriginVisitStatus) matching the selection criteria
-
raw_extrinsic_metadata_add
(metadata: List[swh.model.model.RawExtrinsicMetadata]) → None¶ Add extrinsic metadata on objects (contents, directories, …).
The authority and fetcher must be known to the storage before using this endpoint.
If there is already metadata for the same object, authority, fetcher, and at the same date; the new one will be either dropped or will replace the existing one (it is unspecified which one of these two behaviors happens).
- Parameters
metadata – iterable of RawExtrinsicMetadata objects to be inserted.
-
raw_extrinsic_metadata_get
(type: swh.model.model.MetadataTargetType, target: Union[str, swh.model.identifiers.SWHID], authority: swh.model.model.MetadataAuthority, after: Optional[datetime.datetime] = None, page_token: Optional[bytes] = None, limit: int = 1000) → swh.core.api.classes.PagedResult[swh.model.model.RawExtrinsicMetadata, str]¶ Retrieve list of all raw_extrinsic_metadata entries for the id
- Parameters
type – one of the values of swh.model.model.MetadataTargetType
target – an URL if type is ‘origin’, else a core SWHID
authority – a dict containing keys type and url.
after – minimum discovery_date for a result to be returned
page_token – opaque token, used to get the next page of results
limit – maximum number of results to be returned
- Returns
PagedResult of RawExtrinsicMetadata
-
release_add
(releases: List[swh.model.model.Release]) → Dict¶ Add releases to the storage
- Parameters
releases (List[dict]) –
iterable of dictionaries representing the individual releases to add. Each dict has the following keys:
id (
sha1_git
): id of the release to addrevision (
sha1_git
): id of the revision the release points todate (
dict
): the date the release was madename (
bytes
): the name of the releasecomment (
bytes
): the comment associated with the releaseauthor (
Dict[str, bytes]
): dictionary with keys: name, fullname, email
the date dictionary has the form defined in
swh.model
.- Returns
Summary dict of keys with associated count as values
release:add: New objects contents actually stored in db
-
release_get
(releases: List[bytes]) → List[Optional[swh.model.model.Release]]¶ Given a list of sha1, return the releases’s information
- Parameters
releases – list of sha1s
- Returns
List of releases matching the identifiers or None if the release does not exist.
-
release_get_random
() → bytes¶ Finds a random release id.
- Returns
a sha1_git
-
release_missing
(releases: List[bytes]) → Iterable[bytes]¶ List missing release ids from storage
- Parameters
releases – release ids
- Yields
a list of missing release ids
-
revision_add
(revisions: List[swh.model.model.Revision]) → Dict¶ Add revisions to the storage
- Parameters
revisions (List[dict]) –
iterable of dictionaries representing the individual revisions to add. Each dict has the following keys:
id (
sha1_git
): id of the revision to adddate (
dict
): date the revision was writtencommitter_date (
dict
): date the revision got added to the origintype (one of ‘git’, ‘tar’): type of the revision added
directory (
sha1_git
): the directory the revision points atmessage (
bytes
): the message associated with the revisionauthor (
Dict[str, bytes]
): dictionary with keys: name, fullname, emailcommitter (
Dict[str, bytes]
): dictionary with keys: name, fullname, emailmetadata (
jsonb
): extra information as dictionarysynthetic (
bool
): revision’s nature (tarball, directory creates synthetic revision`)parents (
list[sha1_git]
): the parents of this revision
date dictionaries have the form defined in
swh.model
.- Returns
Summary dict of keys with associated count as values
revision:add: New objects actually stored in db
-
revision_get
(revision_ids: List[bytes]) → List[Optional[swh.model.model.Revision]]¶ Get revisions from storage
- Parameters
revisions – revision ids
- Returns
list of revision object (if the revision exists or None otherwise)
-
revision_get_random
() → bytes¶ Finds a random revision id.
- Returns
a sha1_git
-
revision_log
(revisions: List[bytes], limit: Optional[int] = None) → Iterable[Optional[Dict[str, Any]]]¶ Fetch revision entry from the given root revisions.
- Parameters
revisions – array of root revisions to lookup
limit – limitation on the output result. Default to None.
- Yields
revision entries log from the given root root revisions
-
revision_missing
(revisions: List[bytes]) → Iterable[bytes]¶ List revisions missing from storage
- Parameters
revisions – revision ids
- Yields
missing revision ids
-
revision_shortlog
(revisions: List[bytes], limit: Optional[int] = None) → Iterable[Optional[Tuple[bytes, Tuple[bytes, …]]]]¶ Fetch the shortlog for the given revisions
- Parameters
revisions – list of root revisions to lookup
limit – depth limitation for the output
- Yields
a list of (id, parents) tuples
-
skipped_content_add
(content: List[swh.model.model.SkippedContent]) → Dict¶ Add contents to the skipped_content list, which contains (partial) information about content missing from the archive.
- Parameters
contents (iterable) –
iterable of dictionaries representing individual pieces of content to add. Each dictionary has the following keys:
length (Optional[int]): content length (default: -1)
one key for each checksum algorithm in
swh.model.hashutil.ALGORITHMS
, mapped to the corresponding checksum; each is optionalstatus (str): must be “absent”
reason (str): the reason why the content is absent
origin (int): if status = absent, the origin we saw the content in
- Raises
The following exceptions can occur –
- HashCollision in case of collision –
- Any other exceptions raise by the backend –
In case of errors, some content may have been stored in –
the DB and in the objstorage. –
Since additions to both idempotent, that should not be a problem. –
- Returns
skipped_content:add: New skipped contents (no data) added
- Return type
Summary dict with the following key and associated values
-
skipped_content_missing
(contents: List[Dict[str, Any]]) → Iterable[Dict[str, Any]]¶ List skipped contents missing from storage.
- Parameters
contents – iterable of dictionaries containing the data for each checksum algorithm.
- Returns
Iterable of missing skipped contents as dict
-
snapshot_add
(snapshots: List[swh.model.model.Snapshot]) → Dict¶ Add snapshots to the storage.
- Parameters
snapshot ([dict]) –
the snapshots to add, containing the following keys:
id (
bytes
): id of the snapshotbranches (
dict
): branches the snapshot contains, mapping the branch name (bytes
) to the branch target, itself adict
(orNone
if the branch points to an unknown object)target_type (
str
): one ofcontent
,directory
,revision
,release
,snapshot
,alias
target (
bytes
): identifier of the target (currently asha1_git
for all object kinds, or the name of the target branch for aliases)
- Raises
ValueError – if the origin or visit id does not exist.
- Returns
Summary dict of keys with associated count as values
snapshot:add: Count of object actually stored in db
-
snapshot_count_branches
(snapshot_id: bytes) → Optional[Dict[Optional[str], int]]¶ Count the number of branches in the snapshot with the given id
- Parameters
snapshot_id – snapshot identifier
- Returns
A dict whose keys are the target types of branches and values their corresponding amount
-
snapshot_get
(snapshot_id: bytes) → Optional[Dict[str, Any]]¶ Get the content, possibly partial, of a snapshot with the given id
The branches of the snapshot are iterated in the lexicographical order of their names.
Warning
At most 1000 branches contained in the snapshot will be returned for performance reasons. In order to browse the whole set of branches, the method
snapshot_get_branches()
should be used instead.- Parameters
snapshot_id – snapshot identifier
- Returns
- a dict with three keys:
id: identifier of the snapshot
branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
next_branch: the name of the first branch not returned or
None
if the snapshot has less than 1000 branches.
- Return type
dict
-
snapshot_get_branches
(snapshot_id: bytes, branches_from: bytes = b'', branches_count: int = 1000, target_types: Optional[List[str]] = None) → Optional[swh.storage.interface.PartialBranches]¶ Get the content, possibly partial, of a snapshot with the given id
The branches of the snapshot are iterated in the lexicographical order of their names.
- Parameters
snapshot_id – identifier of the snapshot
branches_from – optional parameter used to skip branches whose name is lesser than it before returning them
branches_count – optional parameter used to restrain the amount of returned branches
target_types – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)
- Returns
- None if the snapshot does not exist;
- a dict with three keys otherwise:
id: identifier of the snapshot
branches: a dict of branches contained in the snapshot whose keys are the branches’ names.
next_branch: the name of the first branch not returned or
None
if the snapshot has less than branches_count branches after branches_from included.
- Return type
dict
-
snapshot_get_random
() → bytes¶ Finds a random snapshot id.
- Returns
a sha1_git
-
snapshot_missing
(snapshots: List[bytes]) → Iterable[bytes]¶ List snapshots missing from storage
- Parameters
snapshots – snapshot ids
- Yields
missing snapshot ids
-