swh.web.utils.archive module#


Lookup the passed hashes in a single DB connection, using batch processing.


{filename (An array of) – X, sha1: Y}, string X, hex sha1 string Y.


The same array with elements updated with elem[‘found’] = true if the hash is present in storage, elem[‘found’] = false if not.

swh.web.utils.archive.lookup_hash(q: str) Dict[str, Any][source]#

Check if the storage contains a given content checksum and return it if found.


q – query string of the form <hash_algo:hash>


Dict with key found containing the hash info if the

hash is present, None if not.

swh.web.utils.archive.search_hash(q: str) Dict[str, bool][source]#

Search storage for a given content checksum.


q – query string of the form <hash_algo:hash>


Dict with key found to True or False, according to whether the checksum is present or not


Return filetype information from a specified content.


q – query string of the form <hash_algo:hash>


filetype information (dict) list if the content is found.


Always returns None.

This used to return language information from a specified content, but this is currently disabled.


q – query string of the form <hash_algo:hash>


language information (dict) list if the content is found.


Return license information from a specified content.


q – query string of the form <hash_algo:hash>


license information (dict) list if the content is found.

swh.web.utils.archive.lookup_origin(origin_url: str, lookup_similar_urls: bool = True) OriginInfo[source]#

Return information about the origin matching dict origin.

  • origin_url – URL of origin

  • lookup_similar_urls – if True, lookup origin with and without trailing slash in its URL


origin information as dict.

swh.web.utils.archive.lookup_origins(page_token: str | None, limit: int = 100) PagedResult[OriginInfo, str][source]#

Get list of archived software origins in a paginated way.

Origins are sorted by id before returning them

  • origin_from (int) – The minimum id of the origins to return

  • origin_count (int) – The maximum number of origins to return


Page of OriginInfo

swh.web.utils.archive.lookup_origin_snapshots(origin: OriginInfo) List[str][source]#

Return ids of the snapshots of an origin.


origin – origin’s dict with ‘url’ key


List of unique snapshot identifiers in hexadecimal format resulting from the visits of the origin.

swh.web.utils.archive.search_origin(url_pattern: str, use_ql: bool = False, limit: int = 50, with_visit: bool = False, visit_types: List[str] | None = None, page_token: str | None = None) Tuple[List[OriginInfo], str | None][source]#

Search for origins whose urls contain a provided string pattern or match a provided regular expression.

  • url_pattern – the string pattern to search for in origin urls

  • use_ql – whether to use swh search query language or not

  • limit – the maximum number of found origins to return

  • with_visit – Whether origins with no visit are to be filtered out

  • visit_types – Only origins having any of the provided visit types (e.g. git, svn, pypi) will be returned

  • page_token – opaque string used to get the next results of a search


list of origin information as dict.

swh.web.utils.archive.search_origin_metadata(fulltext: str, limit: int = 50, return_metadata: bool = True) Iterable[OriginMetadataInfo][source]#

Search for origins whose metadata match a provided string pattern.

  • fulltext – the string pattern to search for in origin metadata

  • limit – the maximum number of found origins to return

  • return_metadata – if false, will only return the origin URL


Iterable of origin metadata information for existing origins

swh.web.utils.archive.lookup_origin_intrinsic_metadata(origin_url: str, lookup_similar_urls: bool = True) list[Dict[str, Any]][source]#

Return intrinsic metadata for the given origin (as a JSON-LD/CodeMeta dictionary).

  • origin_url – origin url

  • lookup_similar_urls – if True, lookup origin with and without trailing slash in its URL


swh.web.utils.exc.NotFoundExc – when the origin is not found


origin metadata.

swh.web.utils.archive.lookup_origin_intrinsic_citation_metadata(origin_url: str, lookup_similar_urls: bool = True) List[IntrinsicMetadataFile][source]#

Get raw intrinsic metadata given a software origin, respectively original codemeta.json and citation.cff, for the latest visit snapshot main branch root directory.

  • origin_url – origin url

  • lookup_similar_urls – if True, lookup origin with and without trailing slash in its URL


list of intrinsic metadata files info

swh.web.utils.archive.lookup_intrinsic_citation_metadata_by_target_swhid(target_swhid: str) List[IntrinsicMetadataFile][source]#

Get raw intrinsic metadata given a SWHID, respectively original codemeta.json and citation.cff, for the target object. If the target object is of type Snapshot, get metadata from the main branch (HEAD).


target_swhid – SWHID which can be qualified or not, if the target object is of type Content, it must be qualified with an anchor.


list of intrinsic metadata files info

swh.web.utils.archive.lookup_origin_extrinsic_metadata(origin_url: str, lookup_similar_urls: bool = True) list[Dict[str, Any]][source]#

Return extrinsic metadata for the given origin (as a JSON-LD/CodeMeta dictionary).

  • origin_url – origin url

  • lookup_similar_urls – if True, lookup origin with and without trailing slash in its URL


swh.web.utils.exc.NotFoundExc – when the origin is not found


origin metadata.

swh.web.utils.archive.directory_exists(sha1_git: str) bool[source]#

Checks if a directory can be found in the archive.


sha1_git – directory identifier


whether the directory exists in the archive.


Return information about the directory with id sha1_git.


string (sha1_git as)


directory information as dict.

swh.web.utils.archive.lookup_directory_with_path(sha1_git: str, path: str) Dict[str, Any][source]#

Return directory information for entry with specified path w.r.t. root directory pointed by sha1_git

  • sha1_git – sha1_git corresponding to the directory to which we append paths to (hopefully) find the entry

  • path – the relative path to the entry starting from the root directory pointed by sha1_git


Directory entry information as dict.


swh.web.utils.exc.NotFoundExc – if the directory entry is not found

swh.web.utils.archive.lookup_release(release_sha1_git: str) Dict[str, Any][source]#

Return information about the release with sha1 release_sha1_git.


release_sha1_git – The release’s sha1 as hexadecimal


Release information as dict.

swh.web.utils.archive.lookup_release_multiple(sha1_git_list) Iterator[Dict[str, Any] | None][source]#

Return information about the releases identified with their sha1_git identifiers.


sha1_git_list – A list of release sha1_git identifiers


Iterator of Release metadata information as dict.


ValueError if the identifier provided is not of sha1 nature.

swh.web.utils.archive.lookup_revision(rev_sha1_git) Dict[str, Any][source]#

Return information about the revision with sha1 revision_sha1_git.


revision_sha1_git – The revision’s sha1 as hexadecimal


Revision information as dict.

swh.web.utils.archive.lookup_revision_multiple(sha1_git_list) Iterator[Dict[str, Any] | None][source]#

Return information about the revisions identified with their sha1_git identifiers.


sha1_git_list – A list of revision sha1_git identifiers


revision information as dict if the revision exists, None otherwise.


ValueError if the identifier provided is not of sha1 nature.

swh.web.utils.archive.lookup_revision_message(rev_sha1_git) Dict[str, bytes][source]#

Return the raw message of the revision with sha1 revision_sha1_git.


revision_sha1_git – The revision’s sha1 as hexadecimal



Return type:

Decoded revision message as dict {‘message’

swh.web.utils.archive.lookup_revision_by(origin_url: str, branch_name: str = 'HEAD', timestamp: int | str | None = None)[source]#

Lookup revision by origin, snapshot branch name and visit timestamp.

If branch_name is not provided, lookup using ‘HEAD’ as default. If timestamp is not provided, use the most recent.

  • origin_url – URL of origin to lookup revision

  • branch_name – snapshot branch name

  • timestamp – origin visit time frame


The revision matching the criterions

Return type:



swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion

swh.web.utils.archive.lookup_revision_log(rev_sha1_git, limit)[source]#

Lookup revision log by revision id.

  • rev_sha1_git (str) – The revision’s sha1 as hexadecimal

  • limit (int) – the maximum number of revisions returned


Revision log as list of revision dicts

Return type:


swh.web.utils.archive.lookup_revision_log_by(origin, branch_name, timestamp, limit)[source]#

Lookup revision by origin, snapshot branch name and visit timestamp.

  • origin (Union[int,str]) – origin of the revision

  • branch_name (str) – snapshot branch

  • timestamp (str/int) – origin visit time frame

  • limit (int) – the maximum number of revisions returned


Revision log as list of revision dicts

Return type:



swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion

swh.web.utils.archive.lookup_revision_with_context_by(origin, branch_name, timestamp, sha1_git, limit=100)[source]#

Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. sha1_git_root being resolved through the lookup of a revision by origin, branch_name and ts.

In other words, sha1_git is an ancestor of sha1_git_root.

  • origin (-) – origin of the revision.

  • branch_name (-) – revision’s branch.

  • timestamp (-) – revision’s time frame.

  • sha1_git (-) – one of sha1_git_root’s ancestors.

  • limit (-) – limit the lookup to 100 revisions back.


Pair of (root_revision, revision). Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root

  • - BadInputExc – in case of unknown algo_hash or bad hash.

  • - swh.web.utils.exc.NotFoundExc – if either revision is not found or if sha1_git is not an ancestor of sha1_git_root.

swh.web.utils.archive.lookup_revision_with_context(sha1_git_root: str | Dict[str, Any] | Revision, sha1_git: str, limit: int = 100) Dict[str, Any][source]#

Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root.

In other words, sha1_git is an ancestor of sha1_git_root.

  • sha1_git_root – latest revision. The type is either a sha1 (as an hex

  • dict. (string) or a non converted)

  • sha1_git – one of sha1_git_root’s ancestors

  • limit – limit the lookup to 100 revisions back


Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root

swh.web.utils.archive.lookup_directory_with_revision(sha1_git, dir_path=None, with_data=False)[source]#

Return information on directory pointed by revision with sha1_git. If dir_path is not provided, display top level directory. Otherwise, display the directory pointed by dir_path (if it exists).

  • sha1_git – revision’s hash.

  • dir_path – optional directory pointed to by that revision.

  • with_data – boolean that indicates to retrieve the raw data if the path

  • False (resolves to a content. Default to)


Information on the directory pointed to by that revision.

swh.web.utils.archive.lookup_content(q: str) Dict[str, Any][source]#

Lookup the content designed by q.


q – The release’s sha1 as hexadecimal


swh.web.utils.exc.NotFoundExc – if the requested content is not found

swh.web.utils.archive.lookup_content_raw(q: str) Dict[str, Any][source]#

Lookup the content defined by q.


q – query string of the form <hash_algo:hash>


dict with ‘sha1’ and ‘data’ keys. data representing its raw data decoded.


Return the stat counters for Software Heritage


A dict mapping textual labels to integer values.

swh.web.utils.archive.lookup_origin_visits(origin: str, last_visit: int | None = None, per_page: int = 10) Iterator[OriginVisitInfo][source]#

Yields the origin origins’ visits.


origin – origin to list visits for


Dictionaries of origin_visit for that origin

swh.web.utils.archive.lookup_origin_visit_latest(origin_url: str, require_snapshot: bool = False, type: str | None = None, allowed_statuses: List[str] | None = None, lookup_similar_urls: bool = True) OriginVisitInfo | None[source]#

Return the origin’s latest visit

  • origin_url – origin to list visits for

  • type – Optional visit type to filter on (e.g git, svn, hg, npm, pypi, …)

  • allowed_statuses – list of visit statuses considered to find the latest visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.

  • require_snapshot – filter out origins without a snapshot

  • lookup_similar_urls – if True, lookup origin with and without trailing slash in its URL


The origin visit info as dict if found

swh.web.utils.archive.lookup_origin_visit(origin_url: str, visit_id: int, lookup_similar_urls: bool = True) OriginVisitInfo[source]#

Return information about visit visit_id with origin origin.

  • origin – origin concerned by the visit

  • visit_id – the visit identifier to lookup

  • lookup_similar_urls – if True, lookup origin with and without trailing slash in its URL


swh.web.utils.exc.NotFoundExc – if no origin visit matching the criteria is found


The dict origin_visit concerned

swh.web.utils.archive.origin_visit_find_by_date(origin_url: str, visit_date: datetime, greater_or_equal: bool = True, type: str | None = None) OriginVisitInfo | None[source]#

Retrieve origin visit status whose date is most recent than the provided visit_date.

  • origin_url – origin concerned by the visit

  • visit_date – provided visit date

  • greater_or_equal – ensure returned visit has a date greater or equal than the one passed as parameter

  • type – Optional visit type to filter on (e.g git, svn, hg, npm, pypi, …)


The dict origin_visit_status matching the criteria if any.

swh.web.utils.archive.lookup_snapshot_sizes(snapshot_id: str, branch_name_exclude_prefix: str | None = 'refs/pull/') Dict[str, int][source]#

Count the number of branches in the snapshot with the given id.


snapshot_id (str) – sha1 identifier of the snapshot


A dict whose keys are the target types of branches and values their corresponding amount

Return type:


swh.web.utils.archive.lookup_snapshot(snapshot_id: str, branches_from: str = '', branches_count: int = 1000, target_types: List[str] | None = None, branch_name_include_substring: str | None = None, branch_name_exclude_prefix: str | None = 'refs/pull/') Dict[str, Any][source]#

Return information about a snapshot, aka the list of named branches found during a specific visit of an origin.

  • snapshot_id – sha1 identifier of the snapshot

  • branches_from – optional parameter used to skip branches whose name is lesser than it before returning them

  • branches_count – optional parameter used to restrain the amount of returned branches

  • target_types – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)

  • branch_name_include_substring – if provided, only return branches whose name contains given substring

  • branch_name_exclude_prefix – if provided, do not return branches whose name starts with given pattern


swh.web.utils.exc.NotFoundExc – if the given snapshot_id is missing


A dict filled with the snapshot content.

swh.web.utils.archive.lookup_latest_origin_snapshot(origin: str, allowed_statuses: List[str] | None = None) Dict[str, Any] | None[source]#

Return information about the latest snapshot of an origin.


At most 1000 branches contained in the snapshot will be returned for performance reasons.

  • origin – URL or integer identifier of the origin

  • allowed_statuses – list of visit statuses considered to find the latest snapshot for the visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.


A dict filled with the snapshot content.

swh.web.utils.archive.lookup_snapshot_alias(snapshot_id: str, alias_name: str) Dict[str, Any] | None[source]#

Try to resolve a branch alias in a snapshot.

  • snapshot_id – hexadecimal representation of a snapshot id

  • alias_name – name of the branch alias to resolve


Target branch information or None if the alias does not exist or target a dangling branch.

swh.web.utils.archive.lookup_revision_through(revision, limit=100)[source]#

Retrieve a revision from the criterion stored in revision dictionary.

  • revision – Dictionary of criterion to lookup the revision with.

  • values (Here are the supported combination of possible)

  • origin_url (-)

  • branch_name

  • ts

  • sha1_git (-)

  • origin_url

  • branch_name

  • ts

  • sha1_git_root (-)

  • sha1_git

  • sha1_git


None if the revision is not found or the actual revision.

swh.web.utils.archive.lookup_directory_through_revision(revision, path=None, limit=100, with_data=False)[source]#

Retrieve the directory information from the revision.

  • revision – dictionary of criterion representing a revision to lookup

  • path – directory’s path to lookup.

  • limit – optional query parameter to limit the revisions log (default to 100). For now, note that this limit could impede the transitivity conclusion about sha1_git not being an ancestor of.

  • with_data – indicate to retrieve the content’s raw data if path resolves to a content.


The directory pointing to by the revision criterions at path.

swh.web.utils.archive.vault_cook(bundle_type: str, swhid: CoreSWHID, email=None)[source]#

Cook a vault bundle.

swh.web.utils.archive.vault_download(bundle_type: str, swhid: CoreSWHID)[source]#

Fetch a vault bundle.

swh.web.utils.archive.vault_download_url(bundle_type: str, swhid: CoreSWHID, filename: str) str | None[source]#

Get optional direct download URL for a cooked vault bundle.

swh.web.utils.archive.vault_progress(bundle_type: str, swhid: CoreSWHID)[source]#

Get the current progress of a vault bundle.


Get the list of file changes (insertion / deletion / modification / renaming) for a particular revision.

swh.web.utils.archive.get_revisions_walker(rev_walker_type, rev_start, *args, **kwargs)[source]#

Utility function to instantiate a revisions walker of a given type, see swh.storage.algos.revisions_walker.

  • rev_walker_type (str) – the type of revisions walker to return, possible values are: committer_date, dfs, dfs_post, bfs and path

  • rev_start (str) – hexadecimal representation of a revision identifier

  • args (list) – position arguments to pass to the revisions walker constructor

  • kwargs (dict) – keyword arguments to pass to the revisions walker constructor

swh.web.utils.archive.lookup_object(object_type: ObjectType, object_id: str) Dict[str, Any][source]#

Utility function for looking up an object in the archive by its type and id.

  • object_type (str) – the type of object to lookup, either content, directory, release, revision or snapshot

  • object_id (str) – the sha1_git checksum identifier in hexadecimal form of the object to lookup


A dictionary describing the object or a list of dictionary for the directory object type.

Return type:

Dict[str, Any]

swh.web.utils.archive.lookup_missing_hashes(grouped_swhids: Dict[ObjectType, List[bytes]]) Set[str][source]#

Lookup missing SoftWare Hash IDentifiers using batch processing.

  • with (A dictionary)

  • keys – object types

  • values – object hashes


A set(hexadecimal) of the hashes not found in the storage

swh.web.utils.archive.lookup_origins_by_sha1s(sha1s: List[str]) Iterator[OriginInfo | None][source]#

Lookup origins from the sha1 hash values of their URLs.


sha1s – list of sha1s hexadecimal representation


origin information as dict

swh.web.utils.archive.encode_extid(extid_format: str, extid: bytes) str[source]#
swh.web.utils.archive.decode_extid(extid_format: str, extid: str) bytes[source]#
swh.web.utils.archive.lookup_extid(extid_type: str, extid_format: str, extid: str, extid_version: int | None = None) Dict[str, Any][source]#

Lookup an ExtID by its type and value.

  • extid_type – the type of the ExtID

  • extid_format – the format used to encode the extid in an ASCII string, either base64url, hex or raw.

  • extid – the value of the ExtID

  • extid_version – the version of the ExtID


ExtID information as a dict

swh.web.utils.archive.lookup_extid_by_target(swhid: str, extid_type: str | None = None, extid_version: int | None = None, extid_format: str = 'hex') List[Dict[str, Any]][source]#

Lookup ExtIDs targeting an archived object.

  • extid_type – the type of the ExtID

  • extid_format – the format to use for encoding an extid to an ASCII string, either base64url, hex or raw.

  • extid – the value of the ExtID

  • extid_version – the version of the ExtID


ExtIDs information as a list of dict