swh.web.common.origin_save module

swh.web.common.origin_save.get_origin_save_authorized_urls() List[str][source]

Get the list of origin url prefixes authorized to be immediately loaded into the archive (whitelist).

Returns

The list of authorized origin url prefix

Return type

list

swh.web.common.origin_save.get_origin_save_unauthorized_urls() List[str][source]

Get the list of origin url prefixes forbidden to be loaded into the archive (blacklist).

Returns

the list of unauthorized origin url prefix

Return type

list

swh.web.common.origin_save.can_save_origin(origin_url: str, bypass_pending_review: bool = False) str[source]

Check if a software origin can be saved into the archive.

Based on the origin url, the save request will be either:

  • immediately accepted if the url is whitelisted

  • rejected if the url is blacklisted

  • put in pending state for manual review otherwise

Parameters

origin_url (str) – the software origin url to check

Returns

the origin save request status, either accepted, rejected or pending

Return type

str

swh.web.common.origin_save.get_savable_visit_types() List[str][source]

Get the list of visit types that can be performed through a save request.

Returns

the list of saveable visit types

Return type

list

swh.web.common.origin_save.origin_exists(origin_url: str) swh.web.common.typing.OriginExistenceCheckInfo[source]

Check the origin url for existence. If it exists, extract some more useful information on the origin.

swh.web.common.origin_save.create_save_origin_request(visit_type: str, origin_url: str, bypass_pending_review: bool = False, user_id: Optional[int] = None) swh.web.common.typing.SaveOriginRequestInfo[source]

Create a loading task to save a software origin into the archive.

This function aims to create a software origin loading task trough the use of the swh-scheduler component.

First, some checks are performed to see if the visit type and origin url are valid but also if the the save request can be accepted. If those checks passed, the loading task is then created. Otherwise, the save request is put in pending or rejected state.

All the submitted save requests are logged into the swh-web database to keep track of them.

Parameters
  • visit_type – the type of visit to perform (e.g git, hg, svn, …)

  • origin_url – the url of the origin to save

Raises
  • BadInputExc – the visit type or origin url is invalid or inexistent

  • ForbiddenExc – the provided origin url is blacklisted

Returns

A dict describing the save request with the following keys:

  • visit_type: the type of visit to perform

  • origin_url: the url of the origin

  • save_request_date: the date the request was submitted

  • save_request_status: the request status, either accepted, rejected or pending

  • save_task_status: the origin loading task status, either not created, not yet scheduled, scheduled, succeed or failed

Return type

dict

swh.web.common.origin_save.update_save_origin_requests_from_queryset(requests_queryset: django.db.models.query.QuerySet) List[swh.web.common.typing.SaveOriginRequestInfo][source]

Update all save requests from a SaveOriginRequest queryset, update their status in db and return the list of impacted save_requests.

Parameters

requests_queryset – input SaveOriginRequest queryset

Returns

A list of save origin request info dicts as described in swh.web.common.origin_save.create_save_origin_request()

Return type

list

swh.web.common.origin_save.refresh_save_origin_request_statuses() List[swh.web.common.typing.SaveOriginRequestInfo][source]

Refresh non-terminal save origin requests (SOR) in the backend.

Non-terminal SOR are requests whose status is accepted and their task status are either created, not yet scheduled, scheduled or running.

This shall compute this list of SOR, checks their status in the scheduler and optionally elasticsearch for their current status. Then update those in db.

Finally, this returns the refreshed information on those SOR.

swh.web.common.origin_save.get_save_origin_requests(visit_type: str, origin_url: str) List[swh.web.common.typing.SaveOriginRequestInfo][source]

Get all save requests for a given software origin.

Parameters
  • visit_type – the type of visit

  • origin_url – the url of the origin

Raises
Returns

A list of save origin requests dict as described in swh.web.common.origin_save.create_save_origin_request()

Return type

list

swh.web.common.origin_save.get_save_origin_task_info(save_request_id: int, full_info: bool = True) Dict[str, Any][source]

Get detailed information about an accepted save origin request and its associated loading task.

If the associated loading task info is archived and removed from the scheduler database, returns an empty dictionary.

Parameters
  • save_request_id – identifier of a save origin request

  • full_info – whether to return detailed info for staff users

Returns

  • type: loading task type
    • arguments: loading task arguments

    • id: loading task database identifier

    • backend_id: loading task celery identifier

    • scheduled: loading task scheduling date

    • ended: loading task termination date

    • status: loading task execution status

    • visit_status: Actual visit status

Depending on the availability of the task logs in the elasticsearch cluster of Software Heritage, the returned dictionary may also contain the following keys:

  • name: associated celery task name

  • message: relevant log message from task execution

  • duration: task execution time (only if it succeeded)

  • worker: name of the worker that executed the task

Return type

A dictionary with the following keys

swh.web.common.origin_save.compute_save_requests_metrics() None[source]

Compute Prometheus metrics related to origin save requests:

  • Number of submitted origin save requests

  • Number of accepted origin save requests

  • Save Code Now requests delay between request time and actual time of ingestion