swh.web.client.client module#
Python client for the Software Heritage Web API
Light wrapper around requests for the archive API, taking care of data conversions and pagination.
from swh.web.client.client import WebAPIClient
cli = WebAPIClient()
# retrieve any archived object via its SWHID
cli.get('swh:1:rev:aafb16d69fd30ff58afdd69036a26047f3aebdc6')
# same, but for specific object types
cli.revision('swh:1:rev:aafb16d69fd30ff58afdd69036a26047f3aebdc6')
# get() always retrieve entire objects, following pagination
# WARNING: this might *not* be what you want for large objects
cli.get('swh:1:snp:6a3a2cf0b2b90ce7ae1cf0a221ed68035b686f5a')
# type-specific methods support explicit iteration through pages
next(cli.snapshot('swh:1:snp:cabcc7d7bf639bbe1cc3b41989e1806618dd5764'))
- swh.web.client.client.typify_json(data: Any, obj_type: str) Any [source]#
Type API responses using pythonic types where appropriate
The following conversions are performed:
identifiers are converted from strings to SWHID instances
timestamps are converted from strings to datetime.datetime objects
- class swh.web.client.client.WebAPIClient(api_url: str = 'https://archive.softwareheritage.org/api/1', bearer_token: str | None = None, request_retry=10, retry_status={429}, use_rate_limit: bool = True, automatic_concurrent_queries: bool = True, max_automatic_concurrency: int | None = None)[source]#
Bases:
object
Client for the Software Heritage archive Web API, see https://archive.softwareheritage.org/api/
Create a client for the Software Heritage Web API
See: https://archive.softwareheritage.org/api/
- Parameters:
api_url – base URL for API calls
bearer_token – optional bearer token to do authenticated API calls
use_rate_limit – enable or disable request pacing according to server rate limit information.
automatic_concurrent_queries – if
True
, some large requests that need to be chunked might automatically be issued in parallelmax_automatic_concurrency – maximum number of concurrent requests when
automatic_concurrent_queries
is set
With rate limiting enabled (the default), the client will adjust its request rate if the server provides Rate limiting headers.
The rate limiting will pace out the available requests evenly in the rate limit windows. (except for a small initial budget as explained below)
For example, if there is 600 request remaining for a windows that reset in 5 minutes (300 second), a request will be issuable every 0.5 seconds.
This pace will be enforced overall, allowing for period of inactivity between faster spike.
For example (using the same number as above):
A client that tries to issue requests continuously will have to wait 0.5 second between each requests.
A client that did not issue requests for 1 minutes (60 seconds) will be able to issue 120 requests right away (60 / 0.5) before having to wait 0.5 second between requests.
The above is true regardless of the number of threads using the same WebAPIClient.
In practice, to avoid slowing down small application doing few requests, 10% of the available budget is available immediately, the other 90% of the requests being spread out over the rate limit window.o
This initial “immediate” budget is only granted if at least 25% of the total request budget is available.
- DEFAULT_AUTOMATIC_CONCURENCY = 20#
- property rate_limit_delay#
current rate limit delay in second
- get(swhid: CoreSWHID | str, typify: bool = True, **req_args) Any [source]#
Retrieve information about an object of any kind
Dispatcher method over the more specific methods content(), directory(), etc.
Note that this method will buffer the entire output in case of long, iterable output (e.g., for snapshot()), see the iter() method for streaming.
- iter(swhid: CoreSWHID | str, typify: bool = True, **req_args) Iterator[Dict[str, Any]] [source]#
Stream over the information about an object of any kind
Streaming variant of get()
- content(swhid: CoreSWHID | str, typify: bool = True, **req_args) Dict[str, Any] [source]#
Retrieve information about a content object
- Parameters:
swhid – object persistent identifier
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
req_args – extra keyword arguments for requests.get()
- Raises:
requests.HTTPError – if HTTP request fails
- directory(swhid: CoreSWHID | str, typify: bool = True, **req_args) List[Dict[str, Any]] [source]#
Retrieve information about a directory object
- Parameters:
swhid – object persistent identifier
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
req_args – extra keyword arguments for requests.get()
- Raises:
requests.HTTPError – if HTTP request fails
- revision(swhid: CoreSWHID | str, typify: bool = True, **req_args) Dict[str, Any] [source]#
Retrieve information about a revision object
- Parameters:
swhid – object persistent identifier
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
req_args – extra keyword arguments for requests.get()
- Raises:
requests.HTTPError – if HTTP request fails
- release(swhid: CoreSWHID | str, typify: bool = True, **req_args) Dict[str, Any] [source]#
Retrieve information about a release object
- Parameters:
swhid – object persistent identifier
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
req_args – extra keyword arguments for requests.get()
- Raises:
requests.HTTPError – if HTTP request fails
- snapshot(swhid: CoreSWHID | str, typify: bool = True, **req_args) Iterator[Dict[str, Any]] [source]#
Retrieve information about a snapshot object
- Parameters:
swhid – object persistent identifier
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
req_args – extra keyword arguments for requests.get()
- Returns:
an iterator over partial snapshots (dictionaries mapping branch names to information about where they point to), each containing a subset of available branches
- Raises:
requests.HTTPError – if HTTP request fails
- visits(origin: str, per_page: int | None = None, last_visit: int | None = None, typify: bool = True, **req_args) Iterator[Dict[str, Any]] [source]#
List visits of an origin
- Parameters:
origin – the URL of a software origin
per_page – the number of visits to list
last_visit – visit to start listing from
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
req_args – extra keyword arguments for requests.get()
- Returns:
an iterator over visits of the origin
- Raises:
requests.HTTPError – if HTTP request fails
- last_visit(origin: str, typify: bool = True) Dict[str, Any] [source]#
Return the last visit of an origin.
- Parameters:
origin – the URL of a software origin
typify – if True, convert return value to pythonic types wherever possible, otherwise return raw JSON types (default: True)
- Returns:
The last visit for that origin
- Raises:
requests.HTTPError – if HTTP request fails
- known(swhids: Iterable[CoreSWHID | str], **req_args) Dict[CoreSWHID, Dict[Any, Any]] [source]#
Verify the presence in the archive of several objects at once
- Parameters:
swhids – SWHIDs of the objects to verify
- Returns:
a dictionary mapping object SWHIDs to archive information about them; the dictionary includes a “known” key associated to a boolean value that is true if and only if the object is known to the archive
- Raises:
requests.HTTPError – if HTTP request fails
- content_exists(swhid: CoreSWHID | str, **req_args) bool [source]#
Check if a content object exists in the archive
- Parameters:
swhid – object persistent identifier
req_args – extra keyword arguments for requests.head()
- Raises:
requests.HTTPError – if HTTP request fails
- directory_exists(swhid: CoreSWHID | str, **req_args) bool [source]#
Check if a directory object exists in the archive
- Parameters:
swhid – object persistent identifier
req_args – extra keyword arguments for requests.head()
- Raises:
requests.HTTPError – if HTTP request fails
- revision_exists(swhid: CoreSWHID | str, **req_args) bool [source]#
Check if a revision object exists in the archive
- Parameters:
swhid – object persistent identifier
req_args – extra keyword arguments for requests.head()
- Raises:
requests.HTTPError – if HTTP request fails
- release_exists(swhid: CoreSWHID | str, **req_args) bool [source]#
Check if a release object exists in the archive
- Parameters:
swhid – object persistent identifier
req_args – extra keyword arguments for requests.head()
- Raises:
requests.HTTPError – if HTTP request fails
- snapshot_exists(swhid: CoreSWHID | str, **req_args) bool [source]#
Check if a snapshot object exists in the archive
- Parameters:
swhid – object persistent identifier
req_args – extra keyword arguments for requests.head()
- Raises:
requests.HTTPError – if HTTP request fails
- origin_exists(origin: str, **req_args) bool [source]#
Check if an origin object exists in the archive
- Parameters:
origin – the URL of a software origin
req_args – extra keyword arguments for requests.head()
- Raises:
requests.HTTPError – if HTTP request fails
- content_raw(swhid: CoreSWHID | str, **req_args) Iterator[bytes] [source]#
Iterate over the raw content of a content object
- Parameters:
swhid – object persistent identifier
req_args – extra keyword arguments for requests.get()
- Raises:
requests.HTTPError – if HTTP request fails
- origin_search(query: str, limit: int | None = None, with_visit: bool = False, **req_args) Iterator[Dict[str, Any]] [source]#
List origin search results
- Parameters:
query – search keywords
limit – the maximum number of found origins to return
with_visit – if true, only return origins with at least one visit
- Returns:
an iterator over search results
- Raises:
requests.HTTPError – if HTTP request fails
- origin_save(visit_type: str, origin: str) Dict [source]#
Save code now query for the origin with visit_type.
- Parameters:
visit_type – Type of the visit
origin – the origin to save
- Returns:
The resulting dict of the visit saved
- Raises:
requests.HTTPError – if HTTP request fails
- get_origin(swhid: CoreSWHID) Any | None [source]#
Walk the compressed graph to discover the origin of a given swhid
This method exist for the swh-scanner and is likely to change significantly and/or be replaced, we do not recommend using it.
- cooking_request(bundle_type: str, swhid: CoreSWHID | str, email: str | None = None, **req_args) Dict[str, Any] [source]#
Request a cooking of a bundle
- Parameters:
bundle_type – Type of the bundle
swhid – object persistent identifier
email – e-mail to notify when the archive is ready
req_args – extra keyword arguments for requests.post()
- Returns:
fetch_url (string): the url from which to download the archive progress_message (string): message describing the cooking task progress id (number): the cooking task id status (string): the cooking task status (new/pending/done/failed) swhid (string): the identifier of the object to cook
- Return type:
an object containing the following keys
- Raises:
requests.HTTPError – if HTTP request fails
- cooking_check(bundle_type: str, swhid: CoreSWHID | str, **req_args) Dict[str, Any] [source]#
Check the status of a cooking task
- Parameters:
bundle_type – Type of the bundle
swhid – object persistent identifier
req_args – extra keyword arguments for requests.get()
- Returns:
fetch_url (string): the url from which to download the archive progress_message (string): message describing the cooking task progress id (number): the cooking task id status (string): the cooking task status (new/pending/done/failed) swhid (string): the identifier of the object to cook
- Return type:
an object containing the following keys
- Raises:
requests.HTTPError – if HTTP request fails
- cooking_fetch(bundle_type: str, swhid: CoreSWHID | str, **req_args) Response [source]#
Fetch the archive of a cooking task
- Parameters:
bundle_type – Type of the bundle
swhid – object persistent identifier
req_args – extra keyword arguments for requests.get()
- Returns:
a requests.models.Response object containing a stream of the archive
- Raises:
requests.HTTPError – if HTTP request fails