swh.web.utils.archive module#
- swh.web.utils.archive.lookup_multiple_hashes(hashes)[source]#
Lookup the passed hashes in a single DB connection, using batch processing.
- Parameters:
{filename (An array of) – X, sha1: Y}, string X, hex sha1 string Y.
- Returns:
The same array with elements updated with elem[‘found’] = true if the hash is present in storage, elem[‘found’] = false if not.
- swh.web.utils.archive.lookup_hash(q: str) Dict[str, Any] [source]#
Check if the storage contains a given content checksum and return it if found.
- Parameters:
q – query string of the form <hash_algo:hash>
- Returns:
Dict with key found containing the hash info if the
hash is present, None if not.
- swh.web.utils.archive.search_hash(q: str) Dict[str, bool] [source]#
Search storage for a given content checksum.
- Parameters:
q – query string of the form <hash_algo:hash>
- Returns:
Dict with key found to True or False, according to whether the checksum is present or not
- swh.web.utils.archive.lookup_content_filetype(q)[source]#
Return filetype information from a specified content.
- Parameters:
q – query string of the form <hash_algo:hash>
- Yields:
filetype information (dict) list if the content is found.
- swh.web.utils.archive.lookup_content_language(q)[source]#
Always returns None.
This used to return language information from a specified content, but this is currently disabled.
- Parameters:
q – query string of the form <hash_algo:hash>
- Yields:
language information (dict) list if the content is found.
- swh.web.utils.archive.lookup_content_license(q)[source]#
Return license information from a specified content.
- Parameters:
q – query string of the form <hash_algo:hash>
- Yields:
license information (dict) list if the content is found.
- swh.web.utils.archive.lookup_origin(origin: OriginInfo, lookup_similar_urls: bool = True) OriginInfo [source]#
Return information about the origin matching dict origin.
- Parameters:
origin – origin’s dict with ‘url’ key
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Returns:
origin information as dict.
- swh.web.utils.archive.lookup_origins(page_token: Optional[str], limit: int = 100) PagedResult[OriginInfo, str] [source]#
Get list of archived software origins in a paginated way.
Origins are sorted by id before returning them
- swh.web.utils.archive.lookup_origin_snapshots(origin: OriginInfo) List[str] [source]#
Return ids of the snapshots of an origin.
- Parameters:
origin – origin’s dict with ‘url’ key
- Returns:
List of unique snapshot identifiers in hexadecimal format resulting from the visits of the origin.
- swh.web.utils.archive.search_origin(url_pattern: str, use_ql: bool = False, limit: int = 50, with_visit: bool = False, visit_types: Optional[List[str]] = None, page_token: Optional[str] = None) Tuple[List[OriginInfo], Optional[str]] [source]#
Search for origins whose urls contain a provided string pattern or match a provided regular expression.
- Parameters:
url_pattern – the string pattern to search for in origin urls
use_ql – whether to use swh search query language or not
limit – the maximum number of found origins to return
with_visit – Whether origins with no visit are to be filtered out
visit_types – Only origins having any of the provided visit types (e.g. git, svn, pypi) will be returned
page_token – opaque string used to get the next results of a search
- Returns:
list of origin information as dict.
- swh.web.utils.archive.search_origin_metadata(fulltext: str, limit: int = 50, return_metadata: bool = True) Iterable[OriginMetadataInfo] [source]#
Search for origins whose metadata match a provided string pattern.
- Parameters:
fulltext – the string pattern to search for in origin metadata
limit – the maximum number of found origins to return
return_metadata – if false, will only return the origin URL
- Returns:
Iterable of origin metadata information for existing origins
- swh.web.utils.archive.lookup_origin_intrinsic_metadata(origin_url: str) Dict[str, Any] [source]#
Return intrinsic metadata for origin whose origin matches given origin.
- Parameters:
origin_url – origin url
- Raises:
NotFoundExc when the origin is not found –
- Returns:
origin metadata.
- swh.web.utils.archive.lookup_directory(sha1_git)[source]#
Return information about the directory with id sha1_git.
- Parameters:
string (sha1_git as) –
- Returns:
directory information as dict.
- swh.web.utils.archive.lookup_directory_with_path(sha1_git: str, path: str) Dict[str, Any] [source]#
Return directory information for entry with specified path w.r.t. root directory pointed by sha1_git
- Parameters:
sha1_git – sha1_git corresponding to the directory to which we append paths to (hopefully) find the entry
path – the relative path to the entry starting from the root directory pointed by sha1_git
- Returns:
Directory entry information as dict.
- Raises:
NotFoundExc if the directory entry is not found –
- swh.web.utils.archive.lookup_release(release_sha1_git: str) Dict[str, Any] [source]#
Return information about the release with sha1 release_sha1_git.
- Parameters:
release_sha1_git – The release’s sha1 as hexadecimal
- Returns:
Release information as dict.
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
- swh.web.utils.archive.lookup_release_multiple(sha1_git_list) Iterator[Optional[Dict[str, Any]]] [source]#
Return information about the releases identified with their sha1_git identifiers.
- Parameters:
sha1_git_list – A list of release sha1_git identifiers
- Returns:
Iterator of Release metadata information as dict.
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
- swh.web.utils.archive.lookup_revision(rev_sha1_git) Dict[str, Any] [source]#
Return information about the revision with sha1 revision_sha1_git.
- Parameters:
revision_sha1_git – The revision’s sha1 as hexadecimal
- Returns:
Revision information as dict.
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
NotFoundExc if there is no revision with the provided sha1_git. –
- swh.web.utils.archive.lookup_revision_multiple(sha1_git_list) Iterator[Optional[Dict[str, Any]]] [source]#
Return information about the revisions identified with their sha1_git identifiers.
- Parameters:
sha1_git_list – A list of revision sha1_git identifiers
- Yields:
revision information as dict if the revision exists, None otherwise.
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
- swh.web.utils.archive.lookup_revision_message(rev_sha1_git) Dict[str, bytes] [source]#
Return the raw message of the revision with sha1 revision_sha1_git.
- Parameters:
revision_sha1_git – The revision’s sha1 as hexadecimal
- Returns:
<the_message>}
- Return type:
Decoded revision message as dict {‘message’
- Raises:
ValueError if the identifier provided is not of sha1 nature. –
NotFoundExc if the revision is not found, or if it has no message –
- swh.web.utils.archive.lookup_revision_by(origin, branch_name='HEAD', timestamp=None)[source]#
Lookup revision by origin, snapshot branch name and visit timestamp.
If branch_name is not provided, lookup using ‘HEAD’ as default. If timestamp is not provided, use the most recent.
- swh.web.utils.archive.lookup_revision_log(rev_sha1_git, limit)[source]#
Lookup revision log by revision id.
- Parameters:
- Returns:
Revision log as list of revision dicts
- Return type:
- Raises:
ValueError – if the identifier provided is not of sha1 nature.
swh.web.utils.exc.NotFoundExc – if there is no revision with the provided sha1_git.
- swh.web.utils.archive.lookup_revision_log_by(origin, branch_name, timestamp, limit)[source]#
Lookup revision by origin, snapshot branch name and visit timestamp.
- Parameters:
- Returns:
Revision log as list of revision dicts
- Return type:
- Raises:
swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion
- swh.web.utils.archive.lookup_revision_with_context_by(origin, branch_name, timestamp, sha1_git, limit=100)[source]#
Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. sha1_git_root being resolved through the lookup of a revision by origin, branch_name and ts.
In other words, sha1_git is an ancestor of sha1_git_root.
- Parameters:
origin (-) – origin of the revision.
branch_name (-) – revision’s branch.
timestamp (-) – revision’s time frame.
sha1_git (-) – one of sha1_git_root’s ancestors.
limit (-) – limit the lookup to 100 revisions back.
- Returns:
Pair of (root_revision, revision). Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root
- Raises:
- BadInputExc in case of unknown algo_hash or bad hash. –
- NotFoundExc if either revision is not found or if sha1_git is not an –
ancestor of sha1_git_root. –
- swh.web.utils.archive.lookup_revision_with_context(sha1_git_root: Union[str, Dict[str, Any], Revision], sha1_git: str, limit: int = 100) Dict[str, Any] [source]#
Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root.
In other words, sha1_git is an ancestor of sha1_git_root.
- Parameters:
sha1_git_root – latest revision. The type is either a sha1 (as an hex
dict. (string) or a non converted) –
sha1_git – one of sha1_git_root’s ancestors
limit – limit the lookup to 100 revisions back
- Returns:
Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root
- Raises:
BadInputExc in case of unknown algo_hash or bad hash –
NotFoundExc if either revision is not found or if sha1_git is not an –
ancestor of sha1_git_root –
- swh.web.utils.archive.lookup_directory_with_revision(sha1_git, dir_path=None, with_data=False)[source]#
Return information on directory pointed by revision with sha1_git. If dir_path is not provided, display top level directory. Otherwise, display the directory pointed by dir_path (if it exists).
- Parameters:
sha1_git – revision’s hash.
dir_path – optional directory pointed to by that revision.
with_data – boolean that indicates to retrieve the raw data if the path
False (resolves to a content. Default to) –
- Returns:
Information on the directory pointed to by that revision.
- Raises:
BadInputExc in case of unknown algo_hash or bad hash. –
NotFoundExc either if the revision is not found or the path referenced –
does not exist. –
NotImplementedError in case of dir_path exists but do not reference a –
type 'dir' or 'file'. –
- swh.web.utils.archive.lookup_content(q: str) Dict[str, Any] [source]#
Lookup the content designed by q.
- Parameters:
q – The release’s sha1 as hexadecimal
- Raises:
NotFoundExc if the requested content is not found –
- swh.web.utils.archive.lookup_content_raw(q: str) Dict[str, Any] [source]#
Lookup the content defined by q.
- Parameters:
q – query string of the form <hash_algo:hash>
- Returns:
dict with ‘sha1’ and ‘data’ keys. data representing its raw data decoded.
- Raises:
NotFoundExc if the requested content is not found or –
if the content bytes are not available in the storage –
- swh.web.utils.archive.stat_counters()[source]#
Return the stat counters for Software Heritage
- Returns:
A dict mapping textual labels to integer values.
- swh.web.utils.archive.lookup_origin_visits(origin: str, last_visit: Optional[int] = None, per_page: int = 10) Iterator[OriginVisitInfo] [source]#
Yields the origin origins’ visits.
- Parameters:
origin – origin to list visits for
- Yields:
Dictionaries of origin_visit for that origin
- swh.web.utils.archive.lookup_origin_visit_latest(origin_url: str, require_snapshot: bool = False, type: Optional[str] = None, allowed_statuses: Optional[List[str]] = None, lookup_similar_urls: bool = True) Optional[OriginVisitInfo] [source]#
Return the origin’s latest visit
- Parameters:
origin_url – origin to list visits for
type – Optional visit type to filter on (e.g git, tar, dsc, svn, hg, npm, pypi, …)
allowed_statuses – list of visit statuses considered to find the latest visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.require_snapshot – filter out origins without a snapshot
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Returns:
The origin visit info as dict if found
- swh.web.utils.archive.lookup_origin_visit(origin_url: str, visit_id: int, lookup_similar_urls: bool = True) OriginVisitInfo [source]#
Return information about visit visit_id with origin origin.
- Parameters:
origin – origin concerned by the visit
visit_id – the visit identifier to lookup
lookup_similar_urls – if
True
, lookup origin with and without trailing slash in its URL
- Raises:
NotFoundExc if no origin visit matching the criteria is found –
- Returns:
The dict origin_visit concerned
- swh.web.utils.archive.origin_visit_find_by_date(origin_url: str, visit_date: datetime, greater_or_equal: bool = True) Optional[OriginVisitInfo] [source]#
Retrieve origin visit status whose date is most recent than the provided visit_date.
- Parameters:
origin_url – origin concerned by the visit
visit_date – provided visit date
greater_or_equal – ensure returned visit has a date greater or equal than the one passed as parameter
- Returns:
The dict origin_visit_status matching the criteria if any.
- swh.web.utils.archive.lookup_snapshot_sizes(snapshot_id: str, branch_name_exclude_prefix: Optional[str] = 'refs/pull/') Dict[str, int] [source]#
Count the number of branches in the snapshot with the given id.
- swh.web.utils.archive.lookup_snapshot(snapshot_id: str, branches_from: str = '', branches_count: int = 1000, target_types: Optional[List[str]] = None, branch_name_include_substring: Optional[str] = None, branch_name_exclude_prefix: Optional[str] = 'refs/pull/') Dict[str, Any] [source]#
Return information about a snapshot, aka the list of named branches found during a specific visit of an origin.
- Parameters:
snapshot_id – sha1 identifier of the snapshot
branches_from – optional parameter used to skip branches whose name is lesser than it before returning them
branches_count – optional parameter used to restrain the amount of returned branches
target_types – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)
branch_name_include_substring – if provided, only return branches whose name contains given substring
branch_name_exclude_prefix – if provided, do not return branches whose name starts with given pattern
- Raises:
NotFoundExc if the given snapshot_id is missing –
- Returns:
A dict filled with the snapshot content.
- swh.web.utils.archive.lookup_latest_origin_snapshot(origin: str, allowed_statuses: Optional[List[str]] = None) Optional[Dict[str, Any]] [source]#
Return information about the latest snapshot of an origin.
Warning
At most 1000 branches contained in the snapshot will be returned for performance reasons.
- Parameters:
origin – URL or integer identifier of the origin
allowed_statuses – list of visit statuses considered to find the latest snapshot for the visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.
- Returns:
A dict filled with the snapshot content.
- swh.web.utils.archive.lookup_snapshot_alias(snapshot_id: str, alias_name: str) Optional[Dict[str, Any]] [source]#
Try to resolve a branch alias in a snapshot.
- Parameters:
snapshot_id – hexadecimal representation of a snapshot id
alias_name – name of the branch alias to resolve
- Returns:
Target branch information or None if the alias does not exist or target a dangling branch.
- swh.web.utils.archive.lookup_revision_through(revision, limit=100)[source]#
Retrieve a revision from the criterion stored in revision dictionary.
- Parameters:
revision – Dictionary of criterion to lookup the revision with.
values (Here are the supported combination of possible) –
origin_url (-) –
branch_name –
ts –
sha1_git (-) –
origin_url –
branch_name –
ts –
sha1_git_root (-) –
sha1_git –
sha1_git –
- Returns:
None if the revision is not found or the actual revision.
- swh.web.utils.archive.lookup_directory_through_revision(revision, path=None, limit=100, with_data=False)[source]#
Retrieve the directory information from the revision.
- Parameters:
revision – dictionary of criterion representing a revision to lookup
path – directory’s path to lookup.
limit – optional query parameter to limit the revisions log (default to 100). For now, note that this limit could impede the transitivity conclusion about sha1_git not being an ancestor of.
with_data – indicate to retrieve the content’s raw data if path resolves to a content.
- Returns:
The directory pointing to by the revision criterions at path.
- swh.web.utils.archive.vault_cook(bundle_type: str, swhid: CoreSWHID, email=None)[source]#
Cook a vault bundle.
- swh.web.utils.archive.vault_fetch(bundle_type: str, swhid: CoreSWHID)[source]#
Fetch a vault bundle.
- swh.web.utils.archive.vault_progress(bundle_type: str, swhid: CoreSWHID)[source]#
Get the current progress of a vault bundle.
- swh.web.utils.archive.diff_revision(rev_id)[source]#
Get the list of file changes (insertion / deletion / modification / renaming) for a particular revision.
- swh.web.utils.archive.get_revisions_walker(rev_walker_type, rev_start, *args, **kwargs)[source]#
Utility function to instantiate a revisions walker of a given type, see
swh.storage.algos.revisions_walker
.- Parameters:
rev_walker_type (str) – the type of revisions walker to return, possible values are:
committer_date
,dfs
,dfs_post
,bfs
andpath
rev_start (str) – hexadecimal representation of a revision identifier
args (list) – position arguments to pass to the revisions walker constructor
kwargs (dict) – keyword arguments to pass to the revisions walker constructor
- swh.web.utils.archive.lookup_object(object_type: ObjectType, object_id: str) Dict[str, Any] [source]#
Utility function for looking up an object in the archive by its type and id.
- Parameters:
- Returns:
A dictionary describing the object or a list of dictionary for the directory object type.
- Return type:
Dict[str, Any]
- Raises:
swh.web.utils.exc.NotFoundExc – if the object could not be found in the archive
BadInputExc – if the object identifier is invalid
- swh.web.utils.archive.lookup_missing_hashes(grouped_swhids: Dict[ObjectType, List[bytes]]) Set[str] [source]#
Lookup missing Software Heritage persistent identifier hash, using batch processing.
- Parameters:
with (A dictionary) –
keys – object types
values – object hashes
- Returns:
A set(hexadecimal) of the hashes not found in the storage