swh.web.common.origin_save module

swh.web.common.origin_save.get_origin_save_authorized_urls() List[str][source]

Get the list of origin url prefixes authorized to be immediately loaded into the archive (whitelist).


The list of authorized origin url prefix

Return type


swh.web.common.origin_save.get_origin_save_unauthorized_urls() List[str][source]

Get the list of origin url prefixes forbidden to be loaded into the archive (blacklist).


the list of unauthorized origin url prefix

Return type


swh.web.common.origin_save.can_save_origin(origin_url: str, bypass_pending_review: bool = False) str[source]

Check if a software origin can be saved into the archive.

Based on the origin url, the save request will be either:

  • immediately accepted if the url is whitelisted

  • rejected if the url is blacklisted

  • put in pending state for manual review otherwise


origin_url (str) – the software origin url to check


the origin save request status, either accepted, rejected or pending

Return type


swh.web.common.origin_save.get_savable_visit_types() List[str][source]

Get the list of visit types that can be performed through a save request.


the list of saveable visit types

Return type


swh.web.common.origin_save.origin_exists(origin_url: str) swh.web.common.typing.OriginExistenceCheckInfo[source]

Check the origin url for existence. If it exists, extract some more useful information on the origin.

swh.web.common.origin_save.create_save_origin_request(visit_type: str, origin_url: str, bypass_pending_review: bool = False, user_id: Optional[int] = None) swh.web.common.typing.SaveOriginRequestInfo[source]

Create a loading task to save a software origin into the archive.

This function aims to create a software origin loading task trough the use of the swh-scheduler component.

First, some checks are performed to see if the visit type and origin url are valid but also if the the save request can be accepted. If those checks passed, the loading task is then created. Otherwise, the save request is put in pending or rejected state.

All the submitted save requests are logged into the swh-web database to keep track of them.

  • visit_type – the type of visit to perform (e.g git, hg, svn, …)

  • origin_url – the url of the origin to save

  • BadInputExc – the visit type or origin url is invalid or inexistent

  • ForbiddenExc – the provided origin url is blacklisted


A dict describing the save request with the following keys:

  • visit_type: the type of visit to perform

  • origin_url: the url of the origin

  • save_request_date: the date the request was submitted

  • save_request_status: the request status, either accepted, rejected or pending

  • save_task_status: the origin loading task status, either not created, not yet scheduled, scheduled, succeed or failed

Return type


swh.web.common.origin_save.update_save_origin_requests_from_queryset(requests_queryset: django.db.models.query.QuerySet) List[swh.web.common.typing.SaveOriginRequestInfo][source]

Update all save requests from a SaveOriginRequest queryset, update their status in db and return the list of impacted save_requests.


requests_queryset – input SaveOriginRequest queryset


A list of save origin request info dicts as described in swh.web.common.origin_save.create_save_origin_request()

Return type


swh.web.common.origin_save.refresh_save_origin_request_statuses() List[swh.web.common.typing.SaveOriginRequestInfo][source]

Refresh non-terminal save origin requests (SOR) in the backend.

Non-terminal SOR are requests whose status is accepted and their task status are either created, not yet scheduled, scheduled or running.

This shall compute this list of SOR, checks their status in the scheduler and optionally elasticsearch for their current status. Then update those in db.

Finally, this returns the refreshed information on those SOR.

swh.web.common.origin_save.get_save_origin_requests(visit_type: str, origin_url: str) List[swh.web.common.typing.SaveOriginRequestInfo][source]

Get all save requests for a given software origin.

  • visit_type – the type of visit

  • origin_url – the url of the origin


A list of save origin requests dict as described in swh.web.common.origin_save.create_save_origin_request()

Return type


swh.web.common.origin_save.get_save_origin_task_info(save_request_id: int, full_info: bool = True) Dict[str, Any][source]

Get detailed information about an accepted save origin request and its associated loading task.

If the associated loading task info is archived and removed from the scheduler database, returns an empty dictionary.

  • save_request_id – identifier of a save origin request

  • full_info – whether to return detailed info for staff users


  • type: loading task type
    • arguments: loading task arguments

    • id: loading task database identifier

    • backend_id: loading task celery identifier

    • scheduled: loading task scheduling date

    • ended: loading task termination date

    • status: loading task execution status

    • visit_status: Actual visit status

Depending on the availability of the task logs in the elasticsearch cluster of Software Heritage, the returned dictionary may also contain the following keys:

  • name: associated celery task name

  • message: relevant log message from task execution

  • duration: task execution time (only if it succeeded)

  • worker: name of the worker that executed the task

Return type

A dictionary with the following keys

swh.web.common.origin_save.compute_save_requests_metrics() None[source]

Compute Prometheus metrics related to origin save requests:

  • Number of submitted origin save requests

  • Number of accepted origin save requests

  • Save Code Now requests delay between request time and actual time of ingestion