swh.storage.algos.revisions_walker module#
- class swh.storage.algos.revisions_walker.State(done: 'Set[Sha1Git]' = <factory>, revs_to_visit: 'Any' = <factory>, last_rev: 'Optional[Dict]' = None, num_revs: 'int' = 0, missing_revs: 'Set[Sha1Git]' = <factory>)[source]#
Bases:
object
- class swh.storage.algos.revisions_walker.RevisionsWalker(storage: StorageInterface, rev_start: Sha1Git, max_revs: int | None = None, state: State | None = None, ignore_displayname: bool = False)[source]#
Bases:
object
Abstract base class encapsulating the logic to walk across a revisions history starting from a given one.
It defines an iterator returning the revisions according to a specific ordering implemented in derived classes.
The iteration step performs the following operations:
Check if the iteration is finished by calling method
is_finished()
and raisesStopIteration
if it it is the caseGet the next unseen revision by calling method
get_next_rev_id()
Process parents of that revision by calling method
process_parent_revs()
for the next iteration stepsCheck if the revision should be returned by calling method
should_return()
and returns it if it is the case
In order to easily instantiate a specific type of revisions walker, it is recommended to use the factory function
get_revisions_walker()
.- Parameters:
storage – instance of swh storage (either local or remote)
rev_start – a revision identifier
max_revs – maximum number of revisions to return
state – previous state of that revisions walker
ignore_displayname – return the original author/committer’s full name even if it’s masked by a displayname.
- abstract process_rev(rev_id: bytes) None [source]#
Abstract method whose purpose is to process a newly visited revision during the walk. Derived classes must implement it according to the desired method to walk across the revisions history (for instance through a dfs on the revisions DAG).
- Parameters:
rev_id – the newly visited revision identifier
- abstract get_next_rev_id() bytes [source]#
Abstract method whose purpose is to return the next revision during the iteration. Derived classes must implement it according to the desired method to walk across the revisions history.
- process_parent_revs(rev: Dict) None [source]#
Process the parents of a revision when it is iterated. The default implementation simply calls
process_rev()
for each parent revision in the order they are declared.- Parameters:
rev (dict) – A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- should_return(rev: Dict) bool [source]#
Filter out a revision to return if needed. Default implementation returns all iterated revisions.
- Parameters:
rev (dict) – A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- Returns:
Whether to return the revision in the iteration
- Return type:
- is_finished() bool [source]#
Determine if the iteration is finished. This method is called at the beginning of each iteration loop.
- Returns:
Whether the iteration is finished
- Return type:
- missing_revisions() Set[bytes] [source]#
Return a set of revision identifiers whose associated data were found missing into the archive content while walking on the revisions graph.
- Returns:
a set of revision identifiers
- Return type:
Set[bytes]
- class swh.storage.algos.revisions_walker.CommitterDateRevisionsWalker(storage: StorageInterface, rev_start: Sha1Git, max_revs: int | None = None, state: State | None = None, ignore_displayname: bool = False)[source]#
Bases:
RevisionsWalker
Revisions walker that returns revisions in reverse chronological order according to committer date (same behaviour as
git log
)- rw_type = 'committer_date'#
- process_rev(rev_id: bytes) None [source]#
Add the revision to a priority queue according to the committer date.
- Parameters:
rev_id (bytes) – the newly visited revision identifier
- get_next_rev_id() bytes [source]#
Return the smallest revision from the priority queue, i.e. the one with highest committer date.
- Returns:
A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- Return type:
- class swh.storage.algos.revisions_walker.BFSRevisionsWalker(*args, **kwargs)[source]#
Bases:
RevisionsWalker
Revisions walker that returns revisions in the same order as when performing a breadth-first search on the revisions DAG.
- rw_type = 'bfs'#
- process_rev(rev_id: bytes) None [source]#
Append the revision to a queue.
- Parameters:
rev_id (bytes) – the newly visited revision identifier
- get_next_rev_id() bytes [source]#
Return the next revision from the queue.
- Returns:
A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- Return type:
- class swh.storage.algos.revisions_walker.DFSPostRevisionsWalker(storage: StorageInterface, rev_start: Sha1Git, max_revs: int | None = None, state: State | None = None, ignore_displayname: bool = False)[source]#
Bases:
RevisionsWalker
Revisions walker that returns revisions in the same order as when performing a depth-first search in post-order on the revisions DAG (i.e. after visiting a merge commit, the merged commit will be visited before the base it was merged on).
- rw_type = 'dfs_post'#
- process_rev(rev_id: bytes) None [source]#
Append the revision to a stack.
- Parameters:
rev_id (bytes) – the newly visited revision identifier
- get_next_rev_id() bytes [source]#
Return the next revision from the stack.
- Returns:
A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- Return type:
- class swh.storage.algos.revisions_walker.DFSRevisionsWalker(storage: StorageInterface, rev_start: Sha1Git, max_revs: int | None = None, state: State | None = None, ignore_displayname: bool = False)[source]#
Bases:
DFSPostRevisionsWalker
Revisions walker that returns revisions in the same order as when performing a depth-first search in pre-order on the revisions DAG (i.e. after visiting a merge commit, the base commit it was merged on will be visited before the merged commit).
- rw_type = 'dfs'#
- class swh.storage.algos.revisions_walker.PathRevisionsWalker(storage, rev_start, path, **kwargs)[source]#
Bases:
CommitterDateRevisionsWalker
Revisions walker that returns revisions where a specific path in the source tree has been modified, in other terms it allows to get the history for a specific file or directory.
It has a behaviour similar to what
git log
offers by default, meaning the returned history is simplified in order to only show relevant revisions (see the History Simplification section of the associated manual for more details).Please note that to avoid walking the entire history, the iteration will stop once a revision where the path has been added is found.
Warning
Due to client-side implementation, performances are not optimal when the total numbers of revisions to walk is large. This should only be used when the total number of revisions does not exceed a couple of thousands.
- Parameters:
storage (swh.storage.interface.StorageInterface) – instance of swh storage (either local or remote)
rev_start (bytes) – a revision identifier
path (str) – the path in the source tree to retrieve the history
max_revs (Optional[int]) – maximum number of revisions to return
state (Optional[dict]) – previous state of that revisions walker
- rw_type = 'path'#
- is_finished()[source]#
Check if the revisions iteration is finished. This checks for the specified path’s existence in the last returned revision’s parents’ source trees. If not, the iteration is considered finished.
- Returns:
Whether to return the revision in the iteration
- Return type:
- process_parent_revs(rev)[source]#
Process parents when a new revision is iterated. It enables to get a simplified revisions history in the same manner as
git log
. When a revision has multiple parents, the following process is applied. If the revision was a merge, and has the same path identifier to one parent, follow only that parent (even if there are several parents with the same path identifier, follow only one of them.) Otherwise, follow all parents.- Parameters:
rev (dict) – A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- should_return(rev)[source]#
Check if a revision should be returned when iterating. It verifies that the specified path has been modified by the revision but also that all parents have a path identifier different from the revision one in order to get a simplified history.
- Parameters:
rev (dict) – A dict describing a revision as returned by
swh.storage.interface.StorageInterface.revision_get()
- Returns:
Whether to return the revision in the iteration
- Return type:
- swh.storage.algos.revisions_walker.get_revisions_walker(rev_walker_type, *args, **kwargs)[source]#
Instantiate a revisions walker of a given type.
The following code snippet demonstrates how to use a revisions walker for processing a whole revisions history:
from swh.storage import get_storage storage = get_storage(...) revs_walker = get_revisions_walker('committer_date', storage, rev_id) for rev in revs_walker: # process revision rev
It is also possible to walk a revisions history in a paginated way as illustrated below:
def get_revs_history_page(rw_type, storage, rev_id, page_num, page_size, rw_state): max_revs = (page_num + 1) * page_size revs_walker = get_revisions_walker(rw_type, storage, rev_id, max_revs=max_revs, state=rw_state) revs = list(revs_walker) rw_state = revs_walker.export_state() return revs rev_start = ... per_page = 50 rw_state = {} for page in range(0, 10): revs_page = get_revs_history_page('dfs', storage, rev_start, page, per_page, rw_state) # process revisions page