swh.web.common.archive module¶
-
swh.web.common.archive.
lookup_multiple_hashes
(hashes)[source]¶ Lookup the passed hashes in a single DB connection, using batch processing.
- Parameters
array of {filename (An) – X, sha1: Y}, string X, hex sha1 string Y.
- Returns
The same array with elements updated with elem[‘found’] = true if the hash is present in storage, elem[‘found’] = false if not.
-
swh.web.common.archive.
lookup_expression
(expression, last_sha1, per_page)[source]¶ Lookup expression in raw content.
- Parameters
expression (str) – An expression to lookup through raw indexed
content –
last_sha1 (str) – Last sha1 seen
per_page (int) – Number of results per page
- Yields
ctags whose content match the expression
-
swh.web.common.archive.
lookup_hash
(q: str) → Dict[str, Any][source]¶ Check if the storage contains a given content checksum and return it if found.
- Parameters
q – query string of the form <hash_algo:hash>
- Returns
Dict with key found containing the hash info if the
hash is present, None if not.
-
swh.web.common.archive.
search_hash
(q: str) → Dict[str, bool][source]¶ Search storage for a given content checksum.
- Parameters
q – query string of the form <hash_algo:hash>
- Returns
Dict with key found to True or False, according to whether the checksum is present or not
Return ctags information from a specified content.
- Parameters
q – query string of the form <hash_algo:hash>
- Yields
ctags information (dict) list if the content is found.
-
swh.web.common.archive.
lookup_content_filetype
(q)[source]¶ Return filetype information from a specified content.
- Parameters
q – query string of the form <hash_algo:hash>
- Yields
filetype information (dict) list if the content is found.
-
swh.web.common.archive.
lookup_content_language
(q)[source]¶ Always returns None.
This used to return language information from a specified content, but this is currently disabled.
- Parameters
q – query string of the form <hash_algo:hash>
- Yields
language information (dict) list if the content is found.
-
swh.web.common.archive.
lookup_content_license
(q)[source]¶ Return license information from a specified content.
- Parameters
q – query string of the form <hash_algo:hash>
- Yields
license information (dict) list if the content is found.
-
swh.web.common.archive.
lookup_origin
(origin: swh.web.common.typing.OriginInfo) → swh.web.common.typing.OriginInfo[source]¶ Return information about the origin matching dict origin.
- Parameters
origin – origin’s dict with ‘url’ key
- Returns
origin information as dict.
-
swh.web.common.archive.
lookup_origins
(page_token: Optional[str], limit: int = 100) → swh.core.api.classes.PagedResult[swh.web.common.typing.OriginInfo, str][source]¶ Get list of archived software origins in a paginated way.
Origins are sorted by id before returning them
- Parameters
origin_from (int) – The minimum id of the origins to return
origin_count (int) – The maximum number of origins to return
- Returns
Page of OriginInfo
-
swh.web.common.archive.
search_origin
(url_pattern: str, limit: int = 50, with_visit: bool = False, page_token: Optional[str] = None) → Tuple[List[swh.web.common.typing.OriginInfo], Optional[str]][source]¶ Search for origins whose urls contain a provided string pattern or match a provided regular expression.
- Parameters
url_pattern – the string pattern to search for in origin urls
limit – the maximum number of found origins to return
page_token – opaque string used to get the next results of a search
- Returns
list of origin information as dict.
-
swh.web.common.archive.
search_origin_metadata
(fulltext: str, limit: int = 50) → Iterable[swh.web.common.typing.OriginMetadataInfo][source]¶ Search for origins whose metadata match a provided string pattern.
- Parameters
fulltext – the string pattern to search for in origin metadata
limit – the maximum number of found origins to return
- Returns
Iterable of origin metadata information for existing origins
-
swh.web.common.archive.
lookup_origin_intrinsic_metadata
(origin_url: str) → Dict[str, Any][source]¶ Return intrinsic metadata for origin whose origin matches given origin.
- Parameters
origin_url – origin url
- Raises
NotFoundExc when the origin is not found –
- Returns
origin metadata.
-
swh.web.common.archive.
lookup_directory
(sha1_git)[source]¶ Return information about the directory with id sha1_git.
- Parameters
as string (sha1_git) –
- Returns
directory information as dict.
-
swh.web.common.archive.
lookup_directory_with_path
(sha1_git: str, path: str) → Dict[str, Any][source]¶ Return directory information for entry with specified path w.r.t. root directory pointed by sha1_git
- Parameters
sha1_git – sha1_git corresponding to the directory to which we append paths to (hopefully) find the entry
path – the relative path to the entry starting from the root directory pointed by sha1_git
- Returns
Directory entry information as dict.
- Raises
NotFoundExc if the directory entry is not found –
-
swh.web.common.archive.
lookup_release
(release_sha1_git: str) → Dict[str, Any][source]¶ Return information about the release with sha1 release_sha1_git.
- Parameters
release_sha1_git – The release’s sha1 as hexadecimal
- Returns
Release information as dict.
- Raises
ValueError if the identifier provided is not of sha1 nature. –
-
swh.web.common.archive.
lookup_release_multiple
(sha1_git_list) → Iterator[Optional[Dict[str, Any]]][source]¶ Return information about the releases identified with their sha1_git identifiers.
- Parameters
sha1_git_list – A list of release sha1_git identifiers
- Returns
Iterator of Release metadata information as dict.
- Raises
ValueError if the identifier provided is not of sha1 nature. –
-
swh.web.common.archive.
lookup_revision
(rev_sha1_git) → Dict[str, Any][source]¶ Return information about the revision with sha1 revision_sha1_git.
- Parameters
revision_sha1_git – The revision’s sha1 as hexadecimal
- Returns
Revision information as dict.
- Raises
ValueError if the identifier provided is not of sha1 nature. –
NotFoundExc if there is no revision with the provided sha1_git. –
-
swh.web.common.archive.
lookup_revision_multiple
(sha1_git_list) → Iterator[Optional[Dict[str, Any]]][source]¶ Return information about the revisions identified with their sha1_git identifiers.
- Parameters
sha1_git_list – A list of revision sha1_git identifiers
- Yields
revision information as dict if the revision exists, None otherwise.
- Raises
ValueError if the identifier provided is not of sha1 nature. –
-
swh.web.common.archive.
lookup_revision_message
(rev_sha1_git) → Dict[str, bytes][source]¶ Return the raw message of the revision with sha1 revision_sha1_git.
- Parameters
revision_sha1_git – The revision’s sha1 as hexadecimal
- Returns
<the_message>}
- Return type
Decoded revision message as dict {‘message’
- Raises
ValueError if the identifier provided is not of sha1 nature. –
NotFoundExc if the revision is not found, or if it has no message –
-
swh.web.common.archive.
lookup_revision_by
(origin, branch_name='HEAD', timestamp=None)[source]¶ Lookup revision by origin, snapshot branch name and visit timestamp.
If branch_name is not provided, lookup using ‘HEAD’ as default. If timestamp is not provided, use the most recent.
- Parameters
origin (Union[int,str]) – origin of the revision
branch_name (str) – snapshot branch name
timestamp (str/int) – origin visit time frame
- Returns
The revision matching the criterions
- Return type
dict
- Raises
NotFoundExc if no revision corresponds to the criterion –
-
swh.web.common.archive.
lookup_revision_log
(rev_sha1_git, limit)[source]¶ Lookup revision log by revision id.
- Parameters
rev_sha1_git (str) – The revision’s sha1 as hexadecimal
limit (int) – the maximum number of revisions returned
- Returns
Revision log as list of revision dicts
- Return type
list
- Raises
ValueError – if the identifier provided is not of sha1 nature.
swh.web.common.exc.NotFoundExc – if there is no revision with the provided sha1_git.
-
swh.web.common.archive.
lookup_revision_log_by
(origin, branch_name, timestamp, limit)[source]¶ Lookup revision by origin, snapshot branch name and visit timestamp.
- Parameters
origin (Union[int,str]) – origin of the revision
branch_name (str) – snapshot branch
timestamp (str/int) – origin visit time frame
limit (int) – the maximum number of revisions returned
- Returns
Revision log as list of revision dicts
- Return type
list
- Raises
swh.web.common.exc.NotFoundExc – if no revision corresponds to the criterion
-
swh.web.common.archive.
lookup_revision_with_context_by
(origin, branch_name, timestamp, sha1_git, limit=100)[source]¶ Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. sha1_git_root being resolved through the lookup of a revision by origin, branch_name and ts.
In other words, sha1_git is an ancestor of sha1_git_root.
- Parameters
origin (-) – origin of the revision.
branch_name (-) – revision’s branch.
timestamp (-) – revision’s time frame.
sha1_git (-) – one of sha1_git_root’s ancestors.
limit (-) – limit the lookup to 100 revisions back.
- Returns
Pair of (root_revision, revision). Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root
- Raises
- BadInputExc in case of unknown algo_hash or bad hash. –
- NotFoundExc if either revision is not found or if sha1_git is not an –
ancestor of sha1_git_root. –
-
swh.web.common.archive.
lookup_revision_with_context
(sha1_git_root: Union[str, Dict[str, Any], swh.model.model.Revision], sha1_git: str, limit: int = 100) → Dict[str, Any][source]¶ Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root.
In other words, sha1_git is an ancestor of sha1_git_root.
- Parameters
sha1_git_root – latest revision. The type is either a sha1 (as an hex
or a non converted dict. (string)) –
sha1_git – one of sha1_git_root’s ancestors
limit – limit the lookup to 100 revisions back
- Returns
Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root
- Raises
BadInputExc in case of unknown algo_hash or bad hash –
NotFoundExc if either revision is not found or if sha1_git is not an –
ancestor of sha1_git_root –
-
swh.web.common.archive.
lookup_directory_with_revision
(sha1_git, dir_path=None, with_data=False)[source]¶ Return information on directory pointed by revision with sha1_git. If dir_path is not provided, display top level directory. Otherwise, display the directory pointed by dir_path (if it exists).
- Parameters
sha1_git – revision’s hash.
dir_path – optional directory pointed to by that revision.
with_data – boolean that indicates to retrieve the raw data if the path
to a content. Default to False (resolves) –
- Returns
Information on the directory pointed to by that revision.
- Raises
BadInputExc in case of unknown algo_hash or bad hash. –
NotFoundExc either if the revision is not found or the path referenced –
does not exist. –
NotImplementedError in case of dir_path exists but do not reference a –
type 'dir' or 'file'. –
-
swh.web.common.archive.
lookup_content
(q: str) → Dict[str, Any][source]¶ Lookup the content designed by q.
- Parameters
q – The release’s sha1 as hexadecimal
- Raises
NotFoundExc if the requested content is not found –
-
swh.web.common.archive.
lookup_content_raw
(q: str) → Dict[str, Any][source]¶ Lookup the content defined by q.
- Parameters
q – query string of the form <hash_algo:hash>
- Returns
dict with ‘sha1’ and ‘data’ keys. data representing its raw data decoded.
- Raises
NotFoundExc if the requested content is not found or –
if the content bytes are not available in the storage –
-
swh.web.common.archive.
stat_counters
()[source]¶ Return the stat counters for Software Heritage
- Returns
A dict mapping textual labels to integer values.
-
swh.web.common.archive.
lookup_origin_visits
(origin: str, last_visit: Optional[int] = None, per_page: int = 10) → Iterator[swh.web.common.typing.OriginVisitInfo][source]¶ Yields the origin origins’ visits.
- Parameters
origin – origin to list visits for
- Yields
Dictionaries of origin_visit for that origin
-
swh.web.common.archive.
lookup_origin_visit_latest
(origin_url: str, require_snapshot: bool = False, type: Optional[str] = None, allowed_statuses: Optional[List[str]] = None) → Optional[swh.web.common.typing.OriginVisitInfo][source]¶ Return the origin’s latest visit
- Parameters
origin_url – origin to list visits for
type – Optional visit type to filter on (e.g git, tar, dsc, svn, hg, npm, pypi, …)
allowed_statuses – list of visit statuses considered to find the latest visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.require_snapshot – filter out origins without a snapshot
- Returns
The origin visit info as dict if found
-
swh.web.common.archive.
lookup_origin_visit
(origin_url: str, visit_id: int) → swh.web.common.typing.OriginVisitInfo[source]¶ Return information about visit visit_id with origin origin.
- Parameters
origin – origin concerned by the visit
visit_id – the visit identifier to lookup
- Yields
The dict origin_visit concerned
-
swh.web.common.archive.
lookup_snapshot_sizes
(snapshot_id: str) → Dict[str, int][source]¶ Count the number of branches in the snapshot with the given id
- Parameters
snapshot_id (str) – sha1 identifier of the snapshot
- Returns
A dict whose keys are the target types of branches and values their corresponding amount
- Return type
dict
-
swh.web.common.archive.
lookup_snapshot
(snapshot_id, branches_from='', branches_count=1000, target_types=None)[source]¶ Return information about a snapshot, aka the list of named branches found during a specific visit of an origin.
- Parameters
snapshot_id (str) – sha1 identifier of the snapshot
branches_from (str) – optional parameter used to skip branches whose name is lesser than it before returning them
branches_count (int) – optional parameter used to restrain the amount of returned branches
target_types (list) – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)
- Returns
A dict filled with the snapshot content.
-
swh.web.common.archive.
lookup_latest_origin_snapshot
(origin: str, allowed_statuses: Optional[List[str]] = None) → Optional[Dict[str, Any]][source]¶ Return information about the latest snapshot of an origin.
Warning
At most 1000 branches contained in the snapshot will be returned for performance reasons.
- Parameters
origin – URL or integer identifier of the origin
allowed_statuses – list of visit statuses considered to find the latest snapshot for the visit. For instance,
allowed_statuses=['full']
will only consider visits that have successfully run to completion.
- Returns
A dict filled with the snapshot content.
-
swh.web.common.archive.
lookup_snapshot_branch_name_from_tip_revision
(snapshot_id: str, revision_id: str) → Optional[str][source]¶ Check if a revision corresponds to the tip of a snapshot branch
- Parameters
snapshot_id – hexadecimal representation of a snapshot id
revision_id – hexadecimal representation of a revision id
- Returns
The name of the first found branch or None otherwise
-
swh.web.common.archive.
lookup_snapshot_alias
(snapshot_id: str, alias_name: str) → Optional[Dict[str, Any]][source]¶ Try to resolve a branch alias in a snapshot.
- Parameters
snapshot_id – hexadecimal representation of a snapshot id
alias_name – name of the branch alias to resolve
- Returns
Target branch information or None if the alias does not exist or target a dangling branch.
-
swh.web.common.archive.
lookup_revision_through
(revision, limit=100)[source]¶ Retrieve a revision from the criterion stored in revision dictionary.
- Parameters
revision – Dictionary of criterion to lookup the revision with.
are the supported combination of possible values (Here) –
origin_url (-) –
branch_name –
ts –
sha1_git (-) –
origin_url –
branch_name –
ts –
sha1_git_root (-) –
sha1_git –
sha1_git –
- Returns
None if the revision is not found or the actual revision.
-
swh.web.common.archive.
lookup_directory_through_revision
(revision, path=None, limit=100, with_data=False)[source]¶ Retrieve the directory information from the revision.
- Parameters
revision – dictionary of criterion representing a revision to lookup
path – directory’s path to lookup.
limit – optional query parameter to limit the revisions log (default to 100). For now, note that this limit could impede the transitivity conclusion about sha1_git not being an ancestor of.
with_data – indicate to retrieve the content’s raw data if path resolves to a content.
- Returns
The directory pointing to by the revision criterions at path.
-
swh.web.common.archive.
vault_progress
(obj_type, obj_id)[source]¶ Get the current progress of a vault bundle.
-
swh.web.common.archive.
diff_revision
(rev_id)[source]¶ Get the list of file changes (insertion / deletion / modification / renaming) for a particular revision.
-
swh.web.common.archive.
get_revisions_walker
(rev_walker_type, rev_start, *args, **kwargs)[source]¶ Utility function to instantiate a revisions walker of a given type, see
swh.storage.algos.revisions_walker
.- Parameters
rev_walker_type (str) – the type of revisions walker to return, possible values are:
committer_date
,dfs
,dfs_post
,bfs
andpath
rev_start (str) – hexadecimal representation of a revision identifier
args (list) – position arguments to pass to the revisions walker constructor
kwargs (dict) – keyword arguments to pass to the revisions walker constructor
-
swh.web.common.archive.
lookup_object
(object_type: str, object_id: str) → Dict[str, Any][source]¶ Utility function for looking up an object in the archive by its type and id.
- Parameters
object_type (str) – the type of object to lookup, either content, directory, release, revision or snapshot
object_id (str) – the sha1_git checksum identifier in hexadecimal form of the object to lookup
- Returns
A dictionary describing the object or a list of dictionary for the directory object type.
- Return type
Dict[str, Any]
- Raises
swh.web.common.exc.NotFoundExc – if the object could not be found in the archive
BadInputExc – if the object identifier is invalid
-
swh.web.common.archive.
lookup_missing_hashes
(grouped_swhids: Dict[str, List[bytes]]) → Set[str][source]¶ Lookup missing Software Heritage persistent identifier hash, using batch processing.
- Parameters
dictionary with (A) –
keys – object types
values – object hashes
- Returns
A set(hexadecimal) of the hashes not found in the storage
-
swh.web.common.archive.
lookup_origins_by_sha1s
(sha1s: List[str]) → Iterator[Optional[swh.web.common.typing.OriginInfo]][source]¶ Lookup origins from the sha1 hash values of their URLs.
- Parameters
sha1s – list of sha1s hexadecimal representation
- Yields
origin information as dict