swh.web.common.archive module

swh.web.common.archive.lookup_multiple_hashes(hashes)[source]

Lookup the passed hashes in a single DB connection, using batch processing.

Parameters

array of {filename (An) – X, sha1: Y}, string X, hex sha1 string Y.

Returns

The same array with elements updated with elem[‘found’] = true if the hash is present in storage, elem[‘found’] = false if not.

swh.web.common.archive.lookup_expression(expression, last_sha1, per_page)[source]

Lookup expression in raw content.

Parameters
  • expression (str) – An expression to lookup through raw indexed

  • content

  • last_sha1 (str) – Last sha1 seen

  • per_page (int) – Number of results per page

Yields

ctags whose content match the expression

swh.web.common.archive.lookup_hash(q: str) → Dict[str, Any][source]

Check if the storage contains a given content checksum and return it if found.

Parameters

q – query string of the form <hash_algo:hash>

Returns

Dict with key found containing the hash info if the

hash is present, None if not.

swh.web.common.archive.search_hash(q: str) → Dict[str, bool][source]

Search storage for a given content checksum.

Parameters

q – query string of the form <hash_algo:hash>

Returns

Dict with key found to True or False, according to whether the checksum is present or not

swh.web.common.archive.lookup_content_ctags(q)[source]

Return ctags information from a specified content.

Parameters

q – query string of the form <hash_algo:hash>

Yields

ctags information (dict) list if the content is found.

swh.web.common.archive.lookup_content_filetype(q)[source]

Return filetype information from a specified content.

Parameters

q – query string of the form <hash_algo:hash>

Yields

filetype information (dict) list if the content is found.

swh.web.common.archive.lookup_content_language(q)[source]

Return language information from a specified content.

Parameters

q – query string of the form <hash_algo:hash>

Yields

language information (dict) list if the content is found.

swh.web.common.archive.lookup_content_license(q)[source]

Return license information from a specified content.

Parameters

q – query string of the form <hash_algo:hash>

Yields

license information (dict) list if the content is found.

swh.web.common.archive.lookup_origin(origin: swh.web.common.typing.OriginInfo)swh.web.common.typing.OriginInfo[source]

Return information about the origin matching dict origin.

Parameters

origin – origin’s dict with ‘url’ key

Returns

origin information as dict.

swh.web.common.archive.lookup_origins(page_token: Optional[str], limit: int = 100)swh.core.api.classes.PagedResult[swh.web.common.typing.OriginInfo, str][source]

Get list of archived software origins in a paginated way.

Origins are sorted by id before returning them

Parameters
  • origin_from (int) – The minimum id of the origins to return

  • origin_count (int) – The maximum number of origins to return

Returns

Page of OriginInfo

swh.web.common.archive.search_origin(url_pattern: str, limit: int = 50, with_visit: bool = False, page_token: Optional[str] = None) → Tuple[List[swh.web.common.typing.OriginInfo], Optional[str]][source]

Search for origins whose urls contain a provided string pattern or match a provided regular expression.

Parameters
  • url_pattern – the string pattern to search for in origin urls

  • limit – the maximum number of found origins to return

  • page_token – opaque string used to get the next results of a search

Returns

list of origin information as dict.

swh.web.common.archive.search_origin_metadata(fulltext: str, limit: int = 50) → Iterable[swh.web.common.typing.OriginMetadataInfo][source]

Search for origins whose metadata match a provided string pattern.

Parameters
  • fulltext – the string pattern to search for in origin metadata

  • limit – the maximum number of found origins to return

Returns

Iterable of origin metadata information for existing origins

swh.web.common.archive.lookup_origin_intrinsic_metadata(origin_url: str) → Dict[str, Any][source]

Return intrinsic metadata for origin whose origin matches given origin.

Parameters

origin_url – origin url

Raises

NotFoundExc when the origin is not found

Returns

origin metadata.

swh.web.common.archive.lookup_directory(sha1_git)[source]

Return information about the directory with id sha1_git.

Parameters

as string (sha1_git) –

Returns

directory information as dict.

swh.web.common.archive.lookup_directory_with_path(sha1_git, path_string)[source]

Return directory information for entry with path path_string w.r.t. root directory pointed by directory_sha1_git

Parameters
  • directory_sha1_git (-) – sha1_git corresponding to the directory

  • which we append paths to (to) –

  • the relative path to the entry starting from the directory pointed by (-) –

  • directory_sha1_git

Raises

NotFoundExc if the directory entry is not found

swh.web.common.archive.lookup_release(release_sha1_git: str) → Dict[str, Any][source]

Return information about the release with sha1 release_sha1_git.

Parameters

release_sha1_git – The release’s sha1 as hexadecimal

Returns

Release information as dict.

Raises

ValueError if the identifier provided is not of sha1 nature.

swh.web.common.archive.lookup_release_multiple(sha1_git_list) → Iterator[Optional[Dict[str, Any]]][source]

Return information about the releases identified with their sha1_git identifiers.

Parameters

sha1_git_list – A list of release sha1_git identifiers

Returns

Iterator of Release metadata information as dict.

Raises

ValueError if the identifier provided is not of sha1 nature.

swh.web.common.archive.lookup_revision(rev_sha1_git) → Dict[str, Any][source]

Return information about the revision with sha1 revision_sha1_git.

Parameters

revision_sha1_git – The revision’s sha1 as hexadecimal

Returns

Revision information as dict.

Raises
  • ValueError if the identifier provided is not of sha1 nature.

  • NotFoundExc if there is no revision with the provided sha1_git.

swh.web.common.archive.lookup_revision_multiple(sha1_git_list) → Iterator[Optional[Dict[str, Any]]][source]

Return information about the revisions identified with their sha1_git identifiers.

Parameters

sha1_git_list – A list of revision sha1_git identifiers

Yields

revision information as dict if the revision exists, None otherwise.

Raises

ValueError if the identifier provided is not of sha1 nature.

swh.web.common.archive.lookup_revision_message(rev_sha1_git) → Dict[str, bytes][source]

Return the raw message of the revision with sha1 revision_sha1_git.

Parameters

revision_sha1_git – The revision’s sha1 as hexadecimal

Returns

<the_message>}

Return type

Decoded revision message as dict {‘message’

Raises
  • ValueError if the identifier provided is not of sha1 nature.

  • NotFoundExc if the revision is not found, or if it has no message

swh.web.common.archive.lookup_revision_by(origin, branch_name='HEAD', timestamp=None)[source]

Lookup revision by origin, snapshot branch name and visit timestamp.

If branch_name is not provided, lookup using ‘HEAD’ as default. If timestamp is not provided, use the most recent.

Parameters
  • origin (Union[int,str]) – origin of the revision

  • branch_name (str) – snapshot branch name

  • timestamp (str/int) – origin visit time frame

Returns

The revision matching the criterions

Return type

dict

Raises

NotFoundExc if no revision corresponds to the criterion

swh.web.common.archive.lookup_revision_log(rev_sha1_git, limit)[source]

Lookup revision log by revision id.

Parameters
  • rev_sha1_git (str) – The revision’s sha1 as hexadecimal

  • limit (int) – the maximum number of revisions returned

Returns

Revision log as list of revision dicts

Return type

list

Raises
  • ValueError – if the identifier provided is not of sha1 nature.

  • swh.web.common.exc.NotFoundExc – if there is no revision with the provided sha1_git.

swh.web.common.archive.lookup_revision_log_by(origin, branch_name, timestamp, limit)[source]

Lookup revision by origin, snapshot branch name and visit timestamp.

Parameters
  • origin (Union[int,str]) – origin of the revision

  • branch_name (str) – snapshot branch

  • timestamp (str/int) – origin visit time frame

  • limit (int) – the maximum number of revisions returned

Returns

Revision log as list of revision dicts

Return type

list

Raises

swh.web.common.exc.NotFoundExc – if no revision corresponds to the criterion

swh.web.common.archive.lookup_revision_with_context_by(origin, branch_name, timestamp, sha1_git, limit=100)[source]

Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. sha1_git_root being resolved through the lookup of a revision by origin, branch_name and ts.

In other words, sha1_git is an ancestor of sha1_git_root.

Parameters
  • origin (-) – origin of the revision.

  • branch_name (-) – revision’s branch.

  • timestamp (-) – revision’s time frame.

  • sha1_git (-) – one of sha1_git_root’s ancestors.

  • limit (-) – limit the lookup to 100 revisions back.

Returns

Pair of (root_revision, revision). Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root

Raises
  • - BadInputExc in case of unknown algo_hash or bad hash.

  • - NotFoundExc if either revision is not found or if sha1_git is not an

  • ancestor of sha1_git_root.

swh.web.common.archive.lookup_revision_with_context(sha1_git_root: Union[str, Dict[str, Any], swh.model.model.Revision], sha1_git: str, limit: int = 100) → Dict[str, Any][source]

Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root.

In other words, sha1_git is an ancestor of sha1_git_root.

Parameters
  • sha1_git_root – latest revision. The type is either a sha1 (as an hex

  • or a non converted dict. (string)) –

  • sha1_git – one of sha1_git_root’s ancestors

  • limit – limit the lookup to 100 revisions back

Returns

Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root

Raises
  • BadInputExc in case of unknown algo_hash or bad hash

  • NotFoundExc if either revision is not found or if sha1_git is not an

  • ancestor of sha1_git_root

swh.web.common.archive.lookup_directory_with_revision(sha1_git, dir_path=None, with_data=False)[source]

Return information on directory pointed by revision with sha1_git. If dir_path is not provided, display top level directory. Otherwise, display the directory pointed by dir_path (if it exists).

Parameters
  • sha1_git – revision’s hash.

  • dir_path – optional directory pointed to by that revision.

  • with_data – boolean that indicates to retrieve the raw data if the path

  • to a content. Default to False (resolves) –

Returns

Information on the directory pointed to by that revision.

Raises
  • BadInputExc in case of unknown algo_hash or bad hash.

  • NotFoundExc either if the revision is not found or the path referenced

  • does not exist.

  • NotImplementedError in case of dir_path exists but do not reference a

  • type 'dir' or 'file'.

swh.web.common.archive.lookup_content(q: str) → Dict[str, Any][source]

Lookup the content designed by q.

Parameters

q – The release’s sha1 as hexadecimal

Raises

NotFoundExc if the requested content is not found

swh.web.common.archive.lookup_content_raw(q: str) → Dict[str, Any][source]

Lookup the content defined by q.

Parameters

q – query string of the form <hash_algo:hash>

Returns

dict with ‘sha1’ and ‘data’ keys. data representing its raw data decoded.

Raises
  • NotFoundExc if the requested content is not found or

  • if the content bytes are not available in the storage

swh.web.common.archive.stat_counters()[source]

Return the stat counters for Software Heritage

Returns

A dict mapping textual labels to integer values.

swh.web.common.archive.lookup_origin_visits(origin: str, last_visit: Optional[int] = None, per_page: int = 10) → Iterator[swh.web.common.typing.OriginVisitInfo][source]

Yields the origin origins’ visits.

Parameters

origin – origin to list visits for

Yields

Dictionaries of origin_visit for that origin

swh.web.common.archive.lookup_origin_visit_latest(origin_url: str, require_snapshot: bool = False, type: Optional[str] = None, allowed_statuses: Optional[List[str]] = None) → Optional[swh.web.common.typing.OriginVisitInfo][source]

Return the origin’s latest visit

Parameters
  • origin_url – origin to list visits for

  • type – Optional visit type to filter on (e.g git, tar, dsc, svn, hg, npm, pypi, …)

  • allowed_statuses – list of visit statuses considered to find the latest visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.

  • require_snapshot – filter out origins without a snapshot

Returns

The origin visit info as dict if found

swh.web.common.archive.lookup_origin_visit(origin_url: str, visit_id: int)swh.web.common.typing.OriginVisitInfo[source]

Return information about visit visit_id with origin origin.

Parameters
  • origin – origin concerned by the visit

  • visit_id – the visit identifier to lookup

Yields

The dict origin_visit concerned

swh.web.common.archive.lookup_snapshot_sizes(snapshot_id)[source]

Count the number of branches in the snapshot with the given id

Parameters

snapshot_id (str) – sha1 identifier of the snapshot

Returns

A dict whose keys are the target types of branches and values their corresponding amount

Return type

dict

swh.web.common.archive.lookup_snapshot(snapshot_id, branches_from='', branches_count=1000, target_types=None)[source]

Return information about a snapshot, aka the list of named branches found during a specific visit of an origin.

Parameters
  • snapshot_id (str) – sha1 identifier of the snapshot

  • branches_from (str) – optional parameter used to skip branches whose name is lesser than it before returning them

  • branches_count (int) – optional parameter used to restrain the amount of returned branches

  • target_types (list) – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’)

Returns

A dict filled with the snapshot content.

swh.web.common.archive.lookup_latest_origin_snapshot(origin: str, allowed_statuses: List[str] = None) → Optional[Dict[str, Any]][source]

Return information about the latest snapshot of an origin.

Warning

At most 1000 branches contained in the snapshot will be returned for performance reasons.

Parameters
  • origin – URL or integer identifier of the origin

  • allowed_statuses – list of visit statuses considered to find the latest snapshot for the visit. For instance, allowed_statuses=['full'] will only consider visits that have successfully run to completion.

Returns

A dict filled with the snapshot content.

swh.web.common.archive.lookup_snapshot_branch_name_from_tip_revision(snapshot_id: str, revision_id: str) → Optional[str][source]

Check if a revision corresponds to the tip of a snapshot branch

Parameters
  • snapshot_id – hexadecimal representation of a snapshot id

  • revision_id – hexadecimal representation of a revision id

Returns

The name of the first found branch or None otherwise

swh.web.common.archive.lookup_revision_through(revision, limit=100)[source]

Retrieve a revision from the criterion stored in revision dictionary.

Parameters
  • revision – Dictionary of criterion to lookup the revision with.

  • are the supported combination of possible values (Here) –

  • origin_url (-) –

  • branch_name

  • ts

  • sha1_git (-) –

  • origin_url

  • branch_name

  • ts

  • sha1_git_root (-) –

  • sha1_git

  • sha1_git

Returns

None if the revision is not found or the actual revision.

swh.web.common.archive.lookup_directory_through_revision(revision, path=None, limit=100, with_data=False)[source]

Retrieve the directory information from the revision.

Parameters
  • revision – dictionary of criterion representing a revision to lookup

  • path – directory’s path to lookup.

  • limit – optional query parameter to limit the revisions log (default to 100). For now, note that this limit could impede the transitivity conclusion about sha1_git not being an ancestor of.

  • with_data – indicate to retrieve the content’s raw data if path resolves to a content.

Returns

The directory pointing to by the revision criterions at path.

swh.web.common.archive.vault_cook(obj_type, obj_id, email=None)[source]

Cook a vault bundle.

swh.web.common.archive.vault_fetch(obj_type, obj_id)[source]

Fetch a vault bundle.

swh.web.common.archive.vault_progress(obj_type, obj_id)[source]

Get the current progress of a vault bundle.

swh.web.common.archive.diff_revision(rev_id)[source]

Get the list of file changes (insertion / deletion / modification / renaming) for a particular revision.

swh.web.common.archive.get_revisions_walker(rev_walker_type, rev_start, *args, **kwargs)[source]

Utility function to instantiate a revisions walker of a given type, see swh.storage.algos.revisions_walker.

Parameters
  • rev_walker_type (str) – the type of revisions walker to return, possible values are: committer_date, dfs, dfs_post, bfs and path

  • rev_start (str) – hexadecimal representation of a revision identifier

  • args (list) – position arguments to pass to the revisions walker constructor

  • kwargs (dict) – keyword arguments to pass to the revisions walker constructor

swh.web.common.archive.lookup_object(object_type: str, object_id: str) → Dict[str, Any][source]

Utility function for looking up an object in the archive by its type and id.

Parameters
  • object_type (str) – the type of object to lookup, either content, directory, release, revision or snapshot

  • object_id (str) – the sha1_git checksum identifier in hexadecimal form of the object to lookup

Returns

A dictionary describing the object or a list of dictionary for the directory object type.

Return type

Dict[str, Any]

Raises
swh.web.common.archive.lookup_missing_hashes(grouped_swhids: Dict[str, List[bytes]]) → Set[str][source]

Lookup missing Software Heritage persistent identifier hash, using batch processing.

Parameters
  • dictionary with (A) –

  • keys – object types

  • values – object hashes

Returns

A set(hexadecimal) of the hashes not found in the storage

swh.web.common.archive.lookup_origins_by_sha1s(sha1s: List[str]) → Iterator[Optional[swh.web.common.typing.OriginInfo]][source]

Lookup origins from the sha1 hash values of their URLs.

Parameters

sha1s – list of sha1s hexadecimal representation

Yields

origin information as dict