swh.web.utils.archive module#
- swh.web.utils.archive.lookup_multiple_hashes(hashes)[source]#
Lookup the passed hashes in a single DB connection, using batch processing.
- Parameters:
{filename (An array of) – X, sha1: Y}, string X, hex sha1 string Y.
- Returns:
The same array with elements updated with elem[‘found’] = true if the hash is present in storage, elem[‘found’] = false if not.
- swh.web.utils.archive.lookup_hash(q: str) Dict[str, Any] [source]#
Check if the storage contains a given content checksum and return it if found.
- Parameters:
q – query string of the form <hash_algo:hash>
- Returns:
Dict with key found containing the hash info if the
hash is present, None if not.
- swh.web.utils.archive.search_hash(q: str) Dict[str, bool] [source]#
Search storage for a given content checksum.
- Parameters:
q – query string of the form <hash_algo:hash>
- Returns:
Dict with key found to True or False, according to whether the checksum is present or not
- swh.web.utils.archive.lookup_content_filetype(q)[source]#
Return filetype information from a specified content.
- Parameters:
q – query string of the form <hash_algo:hash>
- Yields:
filetype information (dict) list if the content is found.
- swh.web.utils.archive.lookup_content_language(q)[source]#
Always returns None.
This used to return language information from a specified content, but this is currently disabled.
- Parameters:
q – query string of the form <hash_algo:hash>
- Yields:
language information (dict) list if the content is found.
- swh.web.utils.archive.lookup_content_license(q)[source]#
Return license information from a specified content.
- Parameters:
q – query string of the form <hash_algo:hash>
- Yields:
license information (dict) list if the content is found.
- swh.web.utils.archive.lookup_origin(origin_url: str, lookup_similar_urls: bool = True) OriginInfo [source]#
Return information about the origin matching dict origin.
- Parameters:
origin_url – URL of origin
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Returns:
origin information as dict.
- swh.web.utils.archive.lookup_origins(page_token: str | None, limit: int = 100) PagedResult[OriginInfo, str] [source]#
Get list of archived software origins in a paginated way.
Origins are sorted by id before returning them
- swh.web.utils.archive.lookup_origin_snapshots(origin: OriginInfo) List[str] [source]#
Return ids of the snapshots of an origin.
- Parameters:
origin – origin’s dict with ‘url’ key
- Returns:
List of unique snapshot identifiers in hexadecimal format resulting from the visits of the origin.
- swh.web.utils.archive.search_origin(url_pattern: str, use_ql: bool = False, limit: int = 50, with_visit: bool = False, visit_types: List[str] | None = None, page_token: str | None = None) Tuple[List[OriginInfo], str | None] [source]#
Search for origins whose urls contain a provided string pattern or match a provided regular expression.
- Parameters:
url_pattern – the string pattern to search for in origin urls
use_ql – whether to use swh search query language or not
limit – the maximum number of found origins to return
with_visit – Whether origins with no visit are to be filtered out
visit_types – Only origins having any of the provided visit types (e.g. git, svn, pypi) will be returned
page_token – opaque string used to get the next results of a search
- Returns:
list of origin information as dict.
- swh.web.utils.archive.search_origin_metadata(fulltext: str, limit: int = 50, return_metadata: bool = True) Iterable[OriginMetadataInfo] [source]#
Search for origins whose metadata match a provided string pattern.
- Parameters:
fulltext – the string pattern to search for in origin metadata
limit – the maximum number of found origins to return
return_metadata – if false, will only return the origin URL
- Returns:
Iterable of origin metadata information for existing origins
- swh.web.utils.archive.lookup_origin_intrinsic_metadata(origin_url: str, lookup_similar_urls: bool = True) list[Dict[str, Any]] [source]#
Return intrinsic metadata for the given origin (as a JSON-LD/CodeMeta dictionary).
- Parameters:
origin_url – origin url
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Raises:
swh.web.utils.exc.NotFoundExc – when the origin is not found
- Returns:
origin metadata.
- swh.web.utils.archive.lookup_origin_intrinsic_citation_metadata(origin_url: str, lookup_similar_urls: bool = True) List[IntrinsicMetadataFile] [source]#
Get raw intrinsic metadata given a software origin, respectively original codemeta.json and citation.cff, for the latest visit snapshot main branch root directory.
- Parameters:
origin_url – origin url
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Returns:
list of intrinsic metadata files info
- Raises:
swh.web.utils.exc.NotFoundExc – when snapshot, branch or directory is missing or no metadata could be found
BadInputExc – when the metadata files could not be decoded
- swh.web.utils.archive.lookup_intrinsic_citation_metadata_by_target_swhid(target_swhid: str) List[IntrinsicMetadataFile] [source]#
Get raw intrinsic metadata given a SWHID, respectively original codemeta.json and citation.cff, for the target object. If the target object is of type
Snapshot
, get metadata from the main branch (HEAD
).- Parameters:
target_swhid – SWHID which can be qualified or not, if the target object is of type
Content
, it must be qualified with an anchor.- Returns:
list of intrinsic metadata files info
- Raises:
swh.web.utils.exc.NotFoundExc – when the target object is missing or no metadata could be found
BadInputExc – when the metadata files could not be decoded
- swh.web.utils.archive.lookup_origin_extrinsic_metadata(origin_url: str, lookup_similar_urls: bool = True) list[Dict[str, Any]] [source]#
Return extrinsic metadata for the given origin (as a JSON-LD/CodeMeta dictionary).
- Parameters:
origin_url – origin url
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Raises:
swh.web.utils.exc.NotFoundExc – when the origin is not found
- Returns:
origin metadata.
- swh.web.utils.archive.directory_exists(sha1_git: str) bool [source]#
Checks if a directory can be found in the archive.
- Parameters:
sha1_git – directory identifier
- Returns:
whether the directory exists in the archive.
- swh.web.utils.archive.lookup_directory(sha1_git)[source]#
Return information about the directory with id sha1_git.
- Parameters:
string (sha1_git as)
- Returns:
directory information as dict.
- swh.web.utils.archive.lookup_directory_with_path(sha1_git: str, path: str) Dict[str, Any] [source]#
Return directory information for entry with specified path w.r.t. root directory pointed by sha1_git
- Parameters:
sha1_git – sha1_git corresponding to the directory to which we append paths to (hopefully) find the entry
path – the relative path to the entry starting from the root directory pointed by sha1_git
- Returns:
Directory entry information as dict.
- Raises:
swh.web.utils.exc.NotFoundExc – if the directory entry is not found
- swh.web.utils.archive.lookup_release(release_sha1_git: str) Dict[str, Any] [source]#
Return information about the release with sha1 release_sha1_git.
- Parameters:
release_sha1_git – The release’s sha1 as hexadecimal
- Returns:
Release information as dict.
- Raises:
ValueError – if the identifier provided is not of sha1 nature.
swh.web.utils.exc.NotFoundExc – if there is no release with the provided sha1_git.
- swh.web.utils.archive.lookup_release_multiple(sha1_git_list) Iterator[Dict[str, Any] | None] [source]#
Return information about the releases identified with their sha1_git identifiers.
- Parameters:
sha1_git_list – A list of release sha1_git identifiers
- Returns:
Iterator of Release metadata information as dict.
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
- swh.web.utils.archive.lookup_revision(rev_sha1_git) Dict[str, Any] [source]#
Return information about the revision with sha1 revision_sha1_git.
- Parameters:
revision_sha1_git – The revision’s sha1 as hexadecimal
- Returns:
Revision information as dict.
- Raises:
ValueError – if the identifier provided is not of sha1 nature.
swh.web.utils.exc.NotFoundExc – if there is no revision with the provided sha1_git.
- swh.web.utils.archive.lookup_revision_multiple(sha1_git_list) Iterator[Dict[str, Any] | None] [source]#
Return information about the revisions identified with their sha1_git identifiers.
- Parameters:
sha1_git_list – A list of revision sha1_git identifiers
- Yields:
revision information as dict if the revision exists, None otherwise.
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
- swh.web.utils.archive.lookup_revision_message(rev_sha1_git) Dict[str, bytes] [source]#
Return the raw message of the revision with sha1 revision_sha1_git.
- Parameters:
revision_sha1_git – The revision’s sha1 as hexadecimal
- Returns:
<the_message>}
- Return type:
Decoded revision message as dict {‘message’
- Raises:
ValueError – if the identifier provided is not of sha1 nature.
swh.web.utils.exc.NotFoundExc – if the revision is not found, or if it has no message
- swh.web.utils.archive.lookup_revision_by(origin_url: str, branch_name: str = 'HEAD', timestamp: int | str | None = None)[source]#
Lookup revision by origin, snapshot branch name and visit timestamp.
If branch_name is not provided, lookup using ‘HEAD’ as default. If timestamp is not provided, use the most recent.
- Parameters:
origin_url – URL of origin to lookup revision
branch_name – snapshot branch name
timestamp – origin visit time frame
- Returns:
The revision matching the criterions
- Return type:
- Raises:
swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion
- swh.web.utils.archive.lookup_revision_log(rev_sha1_git, limit)[source]#
Lookup revision log by revision id.
- Parameters:
- Returns:
Revision log as list of revision dicts
- Return type:
- Raises:
ValueError – if the identifier provided is not of sha1 nature.
swh.web.utils.exc.NotFoundExc – if there is no revision with the provided sha1_git.
- swh.web.utils.archive.lookup_revision_log_by(origin, branch_name, timestamp, limit)[source]#
Lookup revision by origin, snapshot branch name and visit timestamp.
- Parameters:
- Returns:
Revision log as list of revision dicts
- Return type:
- Raises:
swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion
- swh.web.utils.archive.lookup_revision_with_context_by(origin, branch_name, timestamp, sha1_git, limit=100)[source]#
Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. sha1_git_root being resolved through the lookup of a revision by origin, branch_name and ts.
In other words, sha1_git is an ancestor of sha1_git_root.
- Parameters:
origin (-) – origin of the revision.
branch_name (-) – revision’s branch.
timestamp (-) – revision’s time frame.
sha1_git (-) – one of sha1_git_root’s ancestors.
limit (-) – limit the lookup to 100 revisions back.
- Returns:
Pair of (root_revision, revision). Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root
- Raises:
- BadInputExc – in case of unknown algo_hash or bad hash.
- swh.web.utils.exc.NotFoundExc – if either revision is not found or if sha1_git is not an ancestor of sha1_git_root.
- swh.web.utils.archive.lookup_revision_with_context(sha1_git_root: str | Dict[str, Any] | Revision, sha1_git: str, limit: int = 100) Dict[str, Any] [source]#
Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root.
In other words, sha1_git is an ancestor of sha1_git_root.
- Parameters:
sha1_git_root – latest revision. The type is either a sha1 (as an hex
dict. (string) or a non converted)
sha1_git – one of sha1_git_root’s ancestors
limit – limit the lookup to 100 revisions back
- Returns:
Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root
- Raises:
BadInputExc – in case of unknown algo_hash or bad hash
swh.web.utils.exc.NotFoundExc – if either revision is not found or if sha1_git is not an
ancestor of sha1_git_root –
- swh.web.utils.archive.lookup_directory_with_revision(sha1_git, dir_path=None, with_data=False)[source]#
Return information on directory pointed by revision with sha1_git. If dir_path is not provided, display top level directory. Otherwise, display the directory pointed by dir_path (if it exists).
- Parameters:
sha1_git – revision’s hash.
dir_path – optional directory pointed to by that revision.
with_data – boolean that indicates to retrieve the raw data if the path
False (resolves to a content. Default to)
- Returns:
Information on the directory pointed to by that revision.
- Raises:
BadInputExc – in case of unknown algo_hash or bad hash.
swh.web.utils.exc.NotFoundExc – either if the revision is not found or the path referenced does not exist.
NotImplementedError – in case of dir_path exists but do not reference a
type 'dir' or 'file'. –
- swh.web.utils.archive.lookup_content(q: str) Dict[str, Any] [source]#
Lookup the content designed by q.
- Parameters:
q – The release’s sha1 as hexadecimal
- Raises:
swh.web.utils.exc.NotFoundExc – if the requested content is not found
- swh.web.utils.archive.lookup_content_raw(q: str) Dict[str, Any] [source]#
Lookup the content defined by q.
- Parameters:
q – query string of the form <hash_algo:hash>
- Returns:
dict with ‘sha1’ and ‘data’ keys. data representing its raw data decoded.
- Raises:
swh.web.utils.exc.NotFoundExc – if the requested content is not found or
if the content bytes are not available in the storage –
- swh.web.utils.archive.stat_counters()[source]#
Return the stat counters for Software Heritage
- Returns:
A dict mapping textual labels to integer values.
- swh.web.utils.archive.lookup_origin_visits(origin: str, last_visit: int | None = None, per_page: int = 10) Iterator[OriginVisitInfo] [source]#
Yields the origin origins’ visits.
- Parameters:
origin – origin to list visits for
- Yields:
Dictionaries of origin_visit for that origin
- swh.web.utils.archive.lookup_origin_visit_latest(origin_url: str, require_snapshot: bool = False, type: str | None = None, allowed_statuses: List[str] | None = None, lookup_similar_urls: bool = True) OriginVisitInfo | None [source]#
Return the origin’s latest visit
- Parameters:
origin_url – origin to list visits for
type – Optional visit type to filter on (e.g git, svn, hg, npm, pypi, …)
allowed_statuses – list of visit statuses considered to find the latest visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.require_snapshot – filter out origins without a snapshot
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Returns:
The origin visit info as dict if found
- swh.web.utils.archive.lookup_origin_visit(origin_url: str, visit_id: int, lookup_similar_urls: bool = True) OriginVisitInfo [source]#
Return information about visit visit_id with origin origin.
- Parameters:
origin – origin concerned by the visit
visit_id – the visit identifier to lookup
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Raises:
swh.web.utils.exc.NotFoundExc – if no origin visit matching the criteria is found
- Returns:
The dict origin_visit concerned
- swh.web.utils.archive.origin_visit_find_by_date(origin_url: str, visit_date: datetime, greater_or_equal: bool = True, type: str | None = None) OriginVisitInfo | None [source]#
Retrieve origin visit status whose date is most recent than the provided visit_date.
- Parameters:
origin_url – origin concerned by the visit
visit_date – provided visit date
greater_or_equal – ensure returned visit has a date greater or equal than the one passed as parameter
type – Optional visit type to filter on (e.g git, svn, hg, npm, pypi, …)
- Returns:
The dict origin_visit_status matching the criteria if any.
- swh.web.utils.archive.lookup_snapshot_sizes(snapshot_id: str, branch_name_exclude_prefix: str | None = 'refs/pull/') Dict[str, int] [source]#
Count the number of branches in the snapshot with the given id.
- swh.web.utils.archive.lookup_snapshot(snapshot_id: str, branches_from: str = '', branches_count: int = 1000, target_types: List[str] | None = None, branch_name_include_substring: str | None = None, branch_name_exclude_prefix: str | None = 'refs/pull/') Dict[str, Any] [source]#
Return information about a snapshot, aka the list of named branches found during a specific visit of an origin.
- Parameters:
snapshot_id – sha1 identifier of the snapshot
branches_from – optional parameter used to skip branches whose name is lesser than it before returning them
branches_count – optional parameter used to restrain the amount of returned branches
target_types – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)
branch_name_include_substring – if provided, only return branches whose name contains given substring
branch_name_exclude_prefix – if provided, do not return branches whose name starts with given pattern
- Raises:
swh.web.utils.exc.NotFoundExc – if the given snapshot_id is missing
- Returns:
A dict filled with the snapshot content.
- swh.web.utils.archive.lookup_latest_origin_snapshot(origin: str, allowed_statuses: List[str] | None = None) Dict[str, Any] | None [source]#
Return information about the latest snapshot of an origin.
Warning
At most 1000 branches contained in the snapshot will be returned for performance reasons.
- Parameters:
origin – URL or integer identifier of the origin
allowed_statuses – list of visit statuses considered to find the latest snapshot for the visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.
- Returns:
A dict filled with the snapshot content.
- swh.web.utils.archive.lookup_snapshot_alias(snapshot_id: str, alias_name: str) Dict[str, Any] | None [source]#
Try to resolve a branch alias in a snapshot.
- Parameters:
snapshot_id – hexadecimal representation of a snapshot id
alias_name – name of the branch alias to resolve
- Returns:
Target branch information or None if the alias does not exist or target a dangling branch.
- swh.web.utils.archive.lookup_revision_through(revision, limit=100)[source]#
Retrieve a revision from the criterion stored in revision dictionary.
- Parameters:
revision – Dictionary of criterion to lookup the revision with.
values (Here are the supported combination of possible)
origin_url (-)
branch_name
ts
sha1_git (-)
origin_url
branch_name
ts
sha1_git_root (-)
sha1_git
sha1_git
- Returns:
None if the revision is not found or the actual revision.
- swh.web.utils.archive.lookup_directory_through_revision(revision, path=None, limit=100, with_data=False)[source]#
Retrieve the directory information from the revision.
- Parameters:
revision – dictionary of criterion representing a revision to lookup
path – directory’s path to lookup.
limit – optional query parameter to limit the revisions log (default to 100). For now, note that this limit could impede the transitivity conclusion about sha1_git not being an ancestor of.
with_data – indicate to retrieve the content’s raw data if path resolves to a content.
- Returns:
The directory pointing to by the revision criterions at path.
- swh.web.utils.archive.vault_cook(bundle_type: str, swhid: CoreSWHID, email=None)[source]#
Cook a vault bundle.
- swh.web.utils.archive.vault_download(bundle_type: str, swhid: CoreSWHID)[source]#
Fetch a vault bundle.
- swh.web.utils.archive.vault_download_url(bundle_type: str, swhid: CoreSWHID, filename: str) str | None [source]#
Get optional direct download URL for a cooked vault bundle.
- swh.web.utils.archive.vault_progress(bundle_type: str, swhid: CoreSWHID)[source]#
Get the current progress of a vault bundle.
- swh.web.utils.archive.diff_revision(rev_id)[source]#
Get the list of file changes (insertion / deletion / modification / renaming) for a particular revision.
- swh.web.utils.archive.get_revisions_walker(rev_walker_type, rev_start, *args, **kwargs)[source]#
Utility function to instantiate a revisions walker of a given type, see
swh.storage.algos.revisions_walker
.- Parameters:
rev_walker_type (str) – the type of revisions walker to return, possible values are:
committer_date
,dfs
,dfs_post
,bfs
andpath
rev_start (str) – hexadecimal representation of a revision identifier
args (list) – position arguments to pass to the revisions walker constructor
kwargs (dict) – keyword arguments to pass to the revisions walker constructor
- swh.web.utils.archive.lookup_object(object_type: ObjectType, object_id: str) Dict[str, Any] [source]#
Utility function for looking up an object in the archive by its type and id.
- Parameters:
- Returns:
A dictionary describing the object or a list of dictionary for the directory object type.
- Return type:
Dict[str, Any]
- Raises:
swh.web.utils.exc.NotFoundExc – if the object could not be found in the archive
BadInputExc – if the object identifier is invalid
- swh.web.utils.archive.lookup_missing_hashes(grouped_swhids: Dict[ObjectType, List[bytes]]) Set[str] [source]#
Lookup missing SoftWare Hash IDentifiers using batch processing.
- Parameters:
with (A dictionary)
keys – object types
values – object hashes
- Returns:
A set(hexadecimal) of the hashes not found in the storage
- swh.web.utils.archive.lookup_origins_by_sha1s(sha1s: List[str]) Iterator[OriginInfo | None] [source]#
Lookup origins from the sha1 hash values of their URLs.
- Parameters:
sha1s – list of sha1s hexadecimal representation
- Yields:
origin information as dict
- swh.web.utils.archive.lookup_extid(extid_type: str, extid_format: str, extid: str, extid_version: int | None = None) Dict[str, Any] [source]#
Lookup an ExtID by its type and value.
- Parameters:
extid_type – the type of the ExtID
extid_format – the format used to encode the extid in an ASCII string, either
base64url
,hex
orraw
.extid – the value of the ExtID
extid_version – the version of the ExtID
- Returns:
ExtID information as a dict
- swh.web.utils.archive.lookup_extid_by_target(swhid: str, extid_type: str | None = None, extid_version: int | None = None, extid_format: str = 'hex') List[Dict[str, Any]] [source]#
Lookup ExtIDs targeting an archived object.
- Parameters:
extid_type – the type of the ExtID
extid_format – the format to use for encoding an extid to an ASCII string, either
base64url
,hex
orraw
.extid – the value of the ExtID
extid_version – the version of the ExtID
- Returns:
ExtIDs information as a list of dict