swh.scheduler.simulator.origins module

This module implements a model of the frequency of updates of an origin and how long it takes to load it.

For each origin, a commit frequency is chosen deterministically based on the hash of its URL and assume all origins were created on an arbitrary epoch. From this we compute a number of commits, that is the product of these two.

And the run time of a load task is approximated as proportional to the number of commits since the previous visit of the origin (possibly 0).

swh.scheduler.simulator.origins.generate_listed_origin(lister_id: uuid.UUID, now: Optional[datetime.datetime] = None)swh.scheduler.model.ListedOrigin[source]

Returns a globally unique new origin. Seed the last_update value according to the OriginModel and the passed timestamp.

Parameters
  • lister – instance of the lister that generated this origin

  • now – time of listing, to emulate last_update (defaults to datetime.now())

class swh.scheduler.simulator.origins.OriginModel(type: str, origin: str)[source]

Bases: object

MIN_RUN_TIME = 0.5

Minimal run time for a visit (retrieved from production data)

MAX_RUN_TIME = 7200

Max run time for a visit

PER_COMMIT_RUN_TIME = 0.1

Run time per commit

EPOCH = datetime.datetime(2015, 9, 1, 0, 0, tzinfo=datetime.timezone.utc)

The origin of all origins (at least according to Software Heritage)

seconds_between_commits()[source]

Returns a random ‘average time between two commits’ of this origin, used to estimate the run time of a load task, and how much the loading architecture is lagging behind origin updates.

get_last_update(now: datetime.datetime)datetime.datetime[source]

Get the last_update value for this origin.

We assume that the origin had its first commit at EPOCH, and that one commit happened every self.seconds_between_commits(). This returns the last commit date before or equal to now.

get_current_snapshot_id(now: datetime.datetime)bytes[source]

Get the current snapshot for this origin.

To generate a snapshot id, we calculate the number of commits since the EPOCH, and hash it alongside the origin type and url.

load_task_characteristics(now: datetime.datetime)Tuple[float, str, Optional[bytes]][source]

Returns the (run_time, end_status, snapshot id) of the next origin visit.

swh.scheduler.simulator.origins.lister_process(env: swh.scheduler.simulator.common.Environment, lister_id: uuid.UUID)Generator[simpy.events.Event, simpy.events.Event, None][source]

Every hour, generate new origins and update the last_update field for the ones this process generated in the past

swh.scheduler.simulator.origins.load_task_process(env: swh.scheduler.simulator.common.Environment, task: swh.scheduler.simulator.common.Task, status_queue: swh.scheduler.simulator.common.Queue)Iterator[simpy.events.Event][source]

A loading task. This pushes OriginVisitStatus objects to the status_queue to simulate the visible outcomes of the task.

Uses the load_task_duration function to determine its run time.