swh.scheduler.simulator.origins module#

This module implements a model of the frequency of updates of an origin and how long it takes to load it.

For each origin, a commit frequency is chosen deterministically based on the hash of its URL and assume all origins were created on an arbitrary epoch. From this we compute a number of commits, that is the product of these two.

And the run time of a load task is approximated as proportional to the number of commits since the previous visit of the origin (possibly 0).

swh.scheduler.simulator.origins.generate_listed_origin(lister_id: UUID, now: datetime | None = None) ListedOrigin[source]#

Returns a globally unique new origin. Seed the last_update value according to the OriginModel and the passed timestamp.

Parameters:
  • lister – instance of the lister that generated this origin

  • now – time of listing, to emulate last_update (defaults to datetime.now())

class swh.scheduler.simulator.origins.OriginModel(type: str, origin: str)[source]#

Bases: object

MIN_RUN_TIME = 0.5#

Minimal run time for a visit (retrieved from production data)

MAX_RUN_TIME = 7200#

Max run time for a visit

PER_COMMIT_RUN_TIME = 0.1#

Run time per commit

EPOCH = datetime.datetime(2015, 9, 1, 0, 0, tzinfo=datetime.timezone.utc)#

The origin of all origins (at least according to Software Heritage)

seconds_between_commits()[source]#

Returns a random ‘average time between two commits’ of this origin, used to estimate the run time of a load task, and how much the loading architecture is lagging behind origin updates.

get_last_update(now: datetime) datetime[source]#

Get the last_update value for this origin.

We assume that the origin had its first commit at EPOCH, and that one commit happened every self.seconds_between_commits(). This returns the last commit date before or equal to now.

get_current_snapshot_id(now: datetime) bytes[source]#

Get the current snapshot for this origin.

To generate a snapshot id, we calculate the number of commits since the EPOCH, and hash it alongside the origin type and url.

load_task_characteristics(now: datetime) Tuple[float, str, bytes | None][source]#

Returns the (run_time, end_status, snapshot id) of the next origin visit.

swh.scheduler.simulator.origins.lister_process(env: Environment, lister_id: UUID) Generator[Event, Event, None][source]#

Every hour, generate new origins and update the last_update field for the ones this process generated in the past

swh.scheduler.simulator.origins.load_task_process(env: Environment, task: Task, status_queue: Queue) Iterator[Event][source]#

A loading task. This pushes OriginVisitStatus objects to the status_queue to simulate the visible outcomes of the task.

Uses the load_task_duration function to determine its run time.