swh.model.model module#

Implementation of Software Heritage’s data model

See Data model for an overview of the data model.

The classes defined in this module are immutable attrs objects and enums.

All classes define a from_dict class method and a to_dict method to convert between them and msgpack-serializable objects.

exception swh.model.model.MissingData[source]#

Bases: Exception

Raised by Content.with_data when it has no way of fetching the data (but not when fetching the data fails).

swh.model.model.KeyType#

The type returned by BaseModel.unique_key().

alias of Dict[str, str] | Dict[str, bytes] | bytes

swh.model.model.hash_repr(h: bytes) str[source]#
swh.model.model.parents_repr(parents: Tuple[bytes, ...])[source]#
swh.model.model.freeze_optional_dict(d: None | Dict | ImmutableDict) ImmutableDict | None[source]#
swh.model.model.dictify(value)[source]#

Helper function used by BaseModel.to_dict()

swh.model.model.generic_type_validator(instance, attribute, value)[source]#

validates the type of an attribute value whatever the attribute type

swh.model.model.optimized_validator(type_)[source]#
swh.model.model.optimize_all_validators(cls, old_fields)[source]#

process validators to turn them into a faster version … eventually

class swh.model.model.ModelObjectType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: _StringCompatibleEnum

Possible object types of Model object

CONTENT = 'content'#
DIRECTORY = 'directory'#
DIRECTORY_ENTRY = 'directory_entry'#
EXTID = 'extid'#
METADATA_AUTHORITY = 'metadata_authority'#
METADATA_FETCHER = 'metadata_fetcher'#
ORIGIN = 'origin'#
ORIGIN_VISIT = 'origin_visit'#
ORIGIN_VISIT_STATUS = 'origin_visit_status'#
PERSON = 'person'#
RAW_EXTRINSIC_METADATA = 'raw_extrinsic_metadata'#
RELEASE = 'release'#
REVISION = 'revision'#
SKIPPED_CONTENT = 'skipped_content'#
SNAPSHOT = 'snapshot'#
SNAPSHOT_BRANCH = 'snapshot_branch'#
TIMESTAMP = 'timestamp'#
TIMESTAMP_WITH_TIMEZONE = 'timestamp_with_timezone'#
class swh.model.model.BaseModel[source]#

Bases: ABC

Base class for SWH model classes.

Provides serialization/deserialization to/from Python dictionaries, that are suitable for JSON/msgpack-like formats.

abstract property object_type: ModelObjectType#
to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

evolve(**kwargs) ModelType[source]#

Alias to call attr.evolve() on this object, returning a new object.

anonymize() ModelType | None[source]#

Returns an anonymized version of the object, if needed.

If the object model does not need/support anonymization, returns None.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

check() None[source]#

Performs internal consistency checks, and raises an error if one fails.

class swh.model.model.BaseHashableModel[source]#

Bases: BaseModel, ABC

Mixin to automatically compute object identifier hash when the associated model is instantiated.

id: bytes#
compute_hash() bytes[source]#

Derived model classes must implement this to compute the object hash.

This method is called by the object initialization if the id attribute is set to an empty value.

evolve(**kwargs) HashableModelType[source]#

Alias to call attr.evolve() on this object, returning a new object with its id recomputed based on the content.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

check() None[source]#

Performs internal consistency checks, and raises an error if one fails.

swh.model.model.HashableObject#

alias of BaseHashableModel

class swh.model.model.HashableObjectWithManifest[source]#

Bases: BaseHashableModel

Derived class of BaseHashableModel, for objects that may need to store verbatim git objects as raw_manifest to preserve original hashes.

raw_manifest: bytes | None = None#

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

compute_hash() bytes[source]#

Derived model classes must implement this to compute the object hash.

This method is called by the object initialization if the id attribute is set to an empty value.

check() None[source]#

Performs internal consistency checks, and raises an error if one fails.

class swh.model.model.Person(fullname: bytes, name: bytes | None, email: bytes | None)[source]#

Bases: BaseModel

Represents the author/committer of a revision or release.

Method generated by attrs for class Person.

object_type: Final = 'person'#
fullname#
name#
email#
classmethod from_fullname(fullname: bytes)[source]#

Returns a Person object, by guessing the name and email from the fullname, in the name <email> format.

The fullname is left unchanged.

anonymize() Person[source]#

Returns an anonymized version of the Person object.

Anonymization is simply a Person which fullname is the hashed, with unset name or email.

classmethod from_dict(d)[source]#

If the fullname is missing, construct a fullname using the following heuristics: if the name value is None, we return the email in angle brackets, else, we return the name, a space, and the email in angle brackets.

class swh.model.model.Timestamp(seconds: int, microseconds: int)[source]#

Bases: BaseModel

Represents a naive timestamp from a VCS.

Method generated by attrs for class Timestamp.

object_type: Final = 'timestamp'#
seconds#
microseconds#
check_seconds(attribute, value)[source]#

Check that seconds fit in a 64-bits signed integer.

check_microseconds(attribute, value)[source]#

Checks that microseconds are positive and < 1000000.

class swh.model.model.TimestampWithTimezone(timestamp: Timestamp, offset_bytes: bytes)[source]#

Bases: BaseModel

Represents a TZ-aware timestamp from a VCS.

Method generated by attrs for class TimestampWithTimezone.

object_type: Final = 'timestamp_with_timezone'#
timestamp#
offset_bytes#

Raw git representation of the timezone, as an offset from UTC. It should follow this format: +HHMM or -HHMM (including +0000 and -0000).

However, when created from git objects, it must be the exact bytes used in the original objects, so it may differ from this format when they do.

classmethod from_numeric_offset(timestamp: Timestamp, offset: int, negative_utc: bool) TimestampWithTimezone[source]#

Returns a TimestampWithTimezone instance from the old dictionary format (with offset and negative_utc instead of offset_bytes).

classmethod from_dict(time_representation: Dict | datetime | int) TimestampWithTimezone[source]#

Builds a TimestampWithTimezone from any of the formats accepted by swh.model.normalize_timestamp().

classmethod from_datetime(dt: datetime) TimestampWithTimezone[source]#
to_datetime() datetime[source]#

Convert to a datetime (with a timezone set to the recorded fixed UTC offset)

Beware that this conversion can be lossy: -0000 and ‘weird’ offsets cannot be represented. Also note that it may fail due to type overflow.

classmethod from_iso8601(s)[source]#

Builds a TimestampWithTimezone from an ISO8601-formatted string.

offset_minutes()[source]#

Returns the offset, as a number of minutes since UTC.

>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0000"
... ).offset_minutes()
0
>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0200"
... ).offset_minutes()
120
>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"-0200"
... ).offset_minutes()
-120
>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0530"
... ).offset_minutes()
330
class swh.model.model.Origin(url: str, id: bytes = b'')[source]#

Bases: BaseHashableModel

Represents a software source: a VCS and an URL.

Method generated by attrs for class Origin.

object_type: Final = 'origin'#
url#
id: bytes#
unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

swhid() ExtendedSWHID[source]#

Returns a SWHID representing this origin.

check_url(attribute, value)[source]#
class swh.model.model.OriginVisit(origin: str, date: datetime, type: str, visit: int | None = None)[source]#

Bases: BaseModel

Represents an origin visit with a given type at a given point in time, by a SWH loader.

Method generated by attrs for class OriginVisit.

object_type: Final = 'origin_visit'#
origin#
date#
type#

Should not be set before calling ‘origin_visit_add()’.

visit#
check_date(attribute, value)[source]#

Checks the date has a timezone.

to_dict()[source]#

Serializes the date as a string and omits the visit id if it is None.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

class swh.model.model.OriginVisitStatus(origin: str, visit: int, date: datetime, status: str, snapshot: bytes | None, type: str | None = None, metadata: None | Dict | ImmutableDict = None)[source]#

Bases: BaseModel

Represents a visit update of an origin at a given point in time.

Method generated by attrs for class OriginVisitStatus.

object_type: Final = 'origin_visit_status'#
origin#
visit#
date#
status#
snapshot#
type#
metadata#
check_date(attribute, value)[source]#

Checks the date has a timezone.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

origin_swhid() ExtendedSWHID[source]#
snapshot_swhid() CoreSWHID | None[source]#
class swh.model.model.SnapshotTargetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The type of content pointed to by a snapshot branch. Usually a revision or an alias.

CONTENT = 'content'#
DIRECTORY = 'directory'#
REVISION = 'revision'#
RELEASE = 'release'#
SNAPSHOT = 'snapshot'#
ALIAS = 'alias'#
swh.model.model.TargetType#

alias of SnapshotTargetType

class swh.model.model.ReleaseTargetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The type of content pointed to by a release. Usually a revision

CONTENT = 'content'#
DIRECTORY = 'directory'#
REVISION = 'revision'#
RELEASE = 'release'#
SNAPSHOT = 'snapshot'#
swh.model.model.ObjectType#

alias of ReleaseTargetType

class swh.model.model.SnapshotBranch(target: bytes, target_type: SnapshotTargetType)[source]#

Bases: BaseModel

Represents one of the branches of a snapshot.

Method generated by attrs for class SnapshotBranch.

object_type: Final = 'snapshot_branch'#
target#
target_type#
check_target(attribute, value)[source]#

Checks the target type is not an alias, checks the target is a valid sha1_git.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID | None[source]#

Returns a SWHID for the current branch or None if the branch has no target or is an alias.

class swh.model.model.Snapshot(branches: None | Dict | ImmutableDict, id: bytes = b'')[source]#

Bases: BaseHashableModel

Represents the full state of an origin at a given point in time.

Method generated by attrs for class Snapshot.

object_type: Final = 'snapshot'#
branches#
id: bytes#
classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]#

Returns a SWHID representing this object.

class swh.model.model.Release(name: bytes, message: bytes | None, target: bytes | None, target_type: ReleaseTargetType, synthetic: bool, author: Person | None = None, date: TimestampWithTimezone | None = None, metadata: None | Dict | ImmutableDict = None, id: bytes = b'', raw_manifest: bytes | None = None)[source]#

Bases: HashableObjectWithManifest, BaseModel

Method generated by attrs for class Release.

object_type: Final = 'release'#
name#
message#
target#
target_type#
synthetic#
author#
date#
metadata#
id: bytes#
raw_manifest: bytes | None#

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

check_author(attribute, value)[source]#

If the author is None, checks the date is None too.

to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]#

Returns a SWHID representing this object.

target_swhid() CoreSWHID | None[source]#

Returns the SWHID for the target of this release or None if unset.

anonymize() Release[source]#

Returns an anonymized version of the Release object.

Anonymization consists in replacing the author with an anonymized Person object.

class swh.model.model.RevisionType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

GIT = 'git'#
TAR = 'tar'#
DSC = 'dsc'#
SUBVERSION = 'svn'#
MERCURIAL = 'hg'#
CVS = 'cvs'#
BAZAAR = 'bzr'#
swh.model.model.tuplify_extra_headers(value: Iterable)[source]#
class swh.model.model.Revision(message: bytes | None, author: Person | None, committer: Person | None, date: TimestampWithTimezone | None, committer_date: TimestampWithTimezone | None, type: RevisionType, directory: bytes, synthetic: bool, metadata: None | Dict | ImmutableDict = None, parents: Tuple[bytes, ...] = (), id: bytes = b'', extra_headers: Iterable = (), raw_manifest: bytes | None = None)[source]#

Bases: HashableObjectWithManifest, BaseModel

Method generated by attrs for class Revision.

object_type: Final = 'revision'#
message#
author#
committer#
date#
committer_date#
type#
directory#
synthetic#
metadata#
parents#
id: bytes#
extra_headers#
raw_manifest: bytes | None#

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

check_author(attribute, value)[source]#

If the author is None, checks the date is None too.

check_committer(attribute, value)[source]#

If the committer is None, checks the committer_date is None too.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]#

Returns a SWHID representing this object.

directory_swhid() CoreSWHID[source]#

Returns the SWHID for the directory referenced by the revision.

parent_swhids() List[CoreSWHID][source]#

Returns a list of SWHID for the parent revisions.

anonymize() Revision[source]#

Returns an anonymized version of the Revision object.

Anonymization consists in replacing the author and committer with an anonymized Person object.

class swh.model.model.DirectoryEntry(name: bytes, type: str, target: bytes, perms)[source]#

Bases: BaseModel

Method generated by attrs for class DirectoryEntry.

object_type: Final = 'directory_entry'#
name#
type#
target#
perms#

Usually one of the values of swh.model.from_disk.DentryPerms.

DIR_ENTRY_TYPE_TO_SWHID_OBJECT_TYPE = {'dir': ObjectType.DIRECTORY, 'file': ObjectType.CONTENT, 'rev': ObjectType.REVISION}#
check_name(attribute, value)[source]#
swhid() CoreSWHID[source]#

Returns a SWHID for this directory entry

class swh.model.model.Directory(entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: bytes | None = None)[source]#

Bases: HashableObjectWithManifest, BaseModel

Method generated by attrs for class Directory.

object_type: Final = 'directory'#
entries#
id: bytes#
raw_manifest: bytes | None#

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

check_entries(attribute, value)[source]#
classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]#

Returns a SWHID representing this object.

classmethod from_possibly_duplicated_entries(*, entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: bytes | None = None) Tuple[bool, Directory][source]#

Constructs a Directory object from a list of entries that may contain duplicated names.

This is required to represent legacy objects, that were ingested in the storage database before this check was added.

As it is impossible for a Directory instances to have more than one entry with a given names, this function computes a raw_manifest and renames one of the entries before constructing the Directory.

Returns:

(is_corrupt, directory) where is_corrupt is True iff some entry names were indeed duplicated

class swh.model.model.BaseContent(status: str)[source]#

Bases: BaseModel, ABC

Method generated by attrs for class BaseContent.

status#
classmethod from_dict(d, use_subclass=True)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

get_hash(hash_name)[source]#
hashes() Dict[str, bytes][source]#

Returns a dictionary {hash_name: hash_value}

class swh.model.model.Content(sha1: bytes, sha1_git: bytes, sha256: bytes, blake2s256: bytes, length: int, status: str = 'visible', data: bytes | None = None, get_data: Callable[[], bytes] | None = None, ctime: datetime | None = None)[source]#

Bases: BaseContent

Method generated by attrs for class Content.

object_type: Final = 'content'#
sha1#
sha1_git#
sha256#
blake2s256#
length#
status#
data#
get_data#
ctime#
check_length(attribute, value)[source]#

Checks the length is positive.

check_ctime(attribute, value)[source]#

Checks the ctime has a timezone.

to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_data(data, status='visible', ctime=None) Content[source]#

Generate a Content from a given data byte string.

This populates the Content with the hashes and length for the data passed as argument, as well as the data itself.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

with_data(raise_if_missing: bool = True) Content[source]#

Loads the data attribute if get_data is not None.

This call is almost a no-op, but subclasses may overload this method to lazy-load data (eg. from disk or objstorage).

Parameters:

raise_if_missing – if True (default), raise MissingData exception if no data is attached to content object

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

swhid() CoreSWHID[source]#

Returns a SWHID representing this object.

class swh.model.model.SkippedContent(sha1: bytes | None, sha1_git: bytes | None, sha256: bytes | None, blake2s256: bytes | None, length: int | None, status: str, reason: str | None = None, origin: str | None = None, ctime: datetime | None = None)[source]#

Bases: BaseContent

Method generated by attrs for class SkippedContent.

object_type: Final = 'skipped_content'#
sha1#
sha1_git#
sha256#
blake2s256#
length#
status#
reason#
origin#
ctime#
check_reason(attribute, value)[source]#

Checks the reason is full if status != absent.

check_length(attribute, value)[source]#

Checks the length is positive or -1.

check_ctime(attribute, value)[source]#

Checks the ctime has a timezone.

to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_data(data: bytes, reason: str, ctime: datetime | None = None) SkippedContent[source]#

Generate a SkippedContent from a given data byte string.

This populates the SkippedContent with the hashes and length for the data passed as argument.

You can use attr.evolve on such a generated content to nullify some of its attributes, e.g. for tests.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

swhid() CoreSWHID | None[source]#

Returns a SWHID representing this object or None if unset.

class swh.model.model.MetadataAuthorityType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

DEPOSIT_CLIENT = 'deposit_client'#
FORGE = 'forge'#
REGISTRY = 'registry'#
class swh.model.model.MetadataAuthority(type: MetadataAuthorityType, url: str, metadata: None | Dict | ImmutableDict = None)[source]#

Bases: BaseModel

Represents an entity that provides metadata about an origin or software artifact.

Method generated by attrs for class MetadataAuthority.

object_type: Final = 'metadata_authority'#
type#
url#
metadata#
to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

class swh.model.model.MetadataFetcher(name: str, version: str, metadata: None | Dict | ImmutableDict = None)[source]#

Bases: BaseModel

Represents a software component used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive.

Method generated by attrs for class MetadataFetcher.

object_type: Final = 'metadata_fetcher'#
name#
version#
metadata#
to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#

Returns a unique key for this object, that can be used for deduplication.

swh.model.model.normalize_discovery_date(value: Any) datetime[source]#
class swh.model.model.RawExtrinsicMetadata(target: ExtendedSWHID, discovery_date: Any, authority: MetadataAuthority, fetcher: MetadataFetcher, format: str, metadata: bytes, origin: str | None = None, visit: int | None = None, snapshot: CoreSWHID | None = None, release: CoreSWHID | None = None, revision: CoreSWHID | None = None, path: bytes | None = None, directory: CoreSWHID | None = None, id: bytes = b'')[source]#

Bases: BaseHashableModel

Method generated by attrs for class RawExtrinsicMetadata.

object_type: Final = 'raw_extrinsic_metadata'#
target#
discovery_date#
authority#
fetcher#
format#
metadata#
origin#
visit#
snapshot#
release#
revision#
path#
directory#
id: bytes#
check_origin(attribute, value)[source]#
check_visit(attribute, value)[source]#
check_snapshot(attribute, value)[source]#
check_release(attribute, value)[source]#
check_revision(attribute, value)[source]#
check_path(attribute, value)[source]#
check_directory(attribute, value)[source]#
to_dict()[source]#

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() ExtendedSWHID[source]#

Returns a SWHID representing this RawExtrinsicMetadata object.

class swh.model.model.ExtID(extid_type: str, extid: bytes, target: CoreSWHID, extid_version: int = 0, payload_type: str | None = None, payload: bytes | None = None, id: bytes = b'')[source]#

Bases: BaseHashableModel

Method generated by attrs for class ExtID.

object_type: Final = 'extid'#
extid_type#
extid#
target#
extid_version#
payload_type#
payload#
id: bytes#
check_payload_type(attribute, value)[source]#
check_payload(attribute, value)[source]#
classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.