swh.model.model module

Implementation of Software Heritage’s data model

See Data model for an overview of the data model.

The classes defined in this module are immutable attrs objects and enums.

All classes define a from_dict class method and a to_dict method to convert between them and msgpack-serializable objects.

exception swh.model.model.MissingData[source]

Bases: Exception

Raised by Content.with_data when it has no way of fetching the data (but not when fetching the data fails).

swh.model.model.KeyType

The type returned by BaseModel.unique_key().

alias of Union[Dict[str, str], Dict[str, bytes], bytes]

swh.model.model.hash_repr(h: bytes) str[source]
swh.model.model.freeze_optional_dict(d: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]]) Optional[ImmutableDict[KT, VT]][source]
swh.model.model.dictify(value)[source]

Helper function used by BaseModel.to_dict()

swh.model.model.generic_type_validator(instance, attribute, value)[source]

validates the type of an attribute value whatever the attribute type

swh.model.model.optimized_validator(type_)[source]
swh.model.model.optimize_all_validators(cls, old_fields)[source]

process validators to turn them into a faster version … eventually

class swh.model.model.BaseModel[source]

Bases: object

Base class for SWH model classes.

Provides serialization/deserialization to/from Python dictionaries, that are suitable for JSON/msgpack-like formats.

to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

anonymize() Optional[ModelType][source]

Returns an anonymized version of the object, if needed.

If the object model does not need/support anonymization, returns None.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

check() None[source]

Performs internal consistency checks, and raises an error if one fails.

class swh.model.model.HashableObject[source]

Bases: object

Mixin to automatically compute object identifier hash when the associated model is instantiated.

id: bytes
compute_hash() bytes[source]

Derived model classes must implement this to compute the object hash.

This method is called by the object initialization if the id attribute is set to an empty value.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]
check() None[source]
class swh.model.model.HashableObjectWithManifest[source]

Bases: HashableObject

Derived class of HashableObject, for objects that may need to store verbatim git objects as raw_manifest to preserve original hashes.

raw_manifest: Optional[bytes] = None

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

to_dict()[source]
compute_hash() bytes[source]

Derived model classes must implement this to compute the object hash.

This method is called by the object initialization if the id attribute is set to an empty value.

check() None[source]
class swh.model.model.Person(fullname: bytes, name: Optional[bytes], email: Optional[bytes])[source]

Bases: BaseModel

Represents the author/committer of a revision or release.

Method generated by attrs for class Person.

object_type: typing_extensions.Final = 'person'
fullname
name
email
classmethod from_fullname(fullname: bytes)[source]

Returns a Person object, by guessing the name and email from the fullname, in the name <email> format.

The fullname is left unchanged.

anonymize() Person[source]

Returns an anonymized version of the Person object.

Anonymization is simply a Person which fullname is the hashed, with unset name or email.

classmethod from_dict(d)[source]

If the fullname is missing, construct a fullname using the following heuristics: if the name value is None, we return the email in angle brackets, else, we return the name, a space, and the email in angle brackets.

class swh.model.model.Timestamp(seconds: int, microseconds: int)[source]

Bases: BaseModel

Represents a naive timestamp from a VCS.

Method generated by attrs for class Timestamp.

object_type: typing_extensions.Final = 'timestamp'
seconds
microseconds
check_seconds(attribute, value)[source]

Check that seconds fit in a 64-bits signed integer.

check_microseconds(attribute, value)[source]

Checks that microseconds are positive and < 1000000.

class swh.model.model.TimestampWithTimezone(timestamp: Timestamp, offset_bytes: bytes)[source]

Bases: BaseModel

Represents a TZ-aware timestamp from a VCS.

Method generated by attrs for class TimestampWithTimezone.

object_type: typing_extensions.Final = 'timestamp_with_timezone'
timestamp
offset_bytes

Raw git representation of the timezone, as an offset from UTC. It should follow this format: +HHMM or -HHMM (including +0000 and -0000).

However, when created from git objects, it must be the exact bytes used in the original objects, so it may differ from this format when they do.

classmethod from_numeric_offset(timestamp: Timestamp, offset: int, negative_utc: bool) TimestampWithTimezone[source]

Returns a TimestampWithTimezone instance from the old dictionary format (with offset and negative_utc instead of offset_bytes).

classmethod from_dict(time_representation: Union[Dict, datetime, int]) TimestampWithTimezone[source]

Builds a TimestampWithTimezone from any of the formats accepted by swh.model.normalize_timestamp().

classmethod from_datetime(dt: datetime) TimestampWithTimezone[source]
to_datetime() datetime[source]

Convert to a datetime (with a timezone set to the recorded fixed UTC offset)

Beware that this conversion can be lossy: -0000 and ‘weird’ offsets cannot be represented. Also note that it may fail due to type overflow.

classmethod from_iso8601(s)[source]

Builds a TimestampWithTimezone from an ISO8601-formatted string.

offset_minutes()[source]

Returns the offset, as a number of minutes since UTC.

>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0000"
... ).offset_minutes()
0
>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0200"
... ).offset_minutes()
120
>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"-0200"
... ).offset_minutes()
-120
>>> TimestampWithTimezone(
...     Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0530"
... ).offset_minutes()
330
class swh.model.model.Origin(url: str, id: bytes = b'')[source]

Bases: HashableObject, BaseModel

Represents a software source: a VCS and an URL.

Method generated by attrs for class Origin.

object_type: typing_extensions.Final = 'origin'
url
id: bytes
unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

swhid() ExtendedSWHID[source]

Returns a SWHID representing this origin.

class swh.model.model.OriginVisit(origin: str, date: datetime, type: str, visit: Optional[int] = None)[source]

Bases: BaseModel

Represents an origin visit with a given type at a given point in time, by a SWH loader.

Method generated by attrs for class OriginVisit.

object_type: typing_extensions.Final = 'origin_visit'
origin
date
type

Should not be set before calling ‘origin_visit_add()’.

visit
check_date(attribute, value)[source]

Checks the date has a timezone.

to_dict()[source]

Serializes the date as a string and omits the visit id if it is None.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

class swh.model.model.OriginVisitStatus(origin: str, visit: int, date: datetime, status: str, snapshot: Optional[bytes], type: Optional[str] = None, metadata: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]] = None)[source]

Bases: BaseModel

Represents a visit update of an origin at a given point in time.

Method generated by attrs for class OriginVisitStatus.

object_type: typing_extensions.Final = 'origin_visit_status'
origin
visit
date
status
snapshot
type
metadata
check_date(attribute, value)[source]

Checks the date has a timezone.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

class swh.model.model.TargetType(value)[source]

Bases: Enum

The type of content pointed to by a snapshot branch. Usually a revision or an alias.

CONTENT = 'content'
DIRECTORY = 'directory'
REVISION = 'revision'
RELEASE = 'release'
SNAPSHOT = 'snapshot'
ALIAS = 'alias'
class swh.model.model.ObjectType(value)[source]

Bases: Enum

The type of content pointed to by a release. Usually a revision

CONTENT = 'content'
DIRECTORY = 'directory'
REVISION = 'revision'
RELEASE = 'release'
SNAPSHOT = 'snapshot'
class swh.model.model.SnapshotBranch(target: bytes, target_type: TargetType)[source]

Bases: BaseModel

Represents one of the branches of a snapshot.

Method generated by attrs for class SnapshotBranch.

object_type: typing_extensions.Final = 'snapshot_branch'
target
target_type
check_target(attribute, value)[source]

Checks the target type is not an alias, checks the target is a valid sha1_git.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

class swh.model.model.Snapshot(branches: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]], id: bytes = b'')[source]

Bases: HashableObject, BaseModel

Represents the full state of an origin at a given point in time.

Method generated by attrs for class Snapshot.

object_type: typing_extensions.Final = 'snapshot'
branches
id: bytes
classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]

Returns a SWHID representing this object.

class swh.model.model.Release(name: bytes, message: Optional[bytes], target: Optional[bytes], target_type: ObjectType, synthetic: bool, author: Optional[Person] = None, date: Optional[TimestampWithTimezone] = None, metadata: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]] = None, id: bytes = b'', raw_manifest: Optional[bytes] = None)[source]

Bases: HashableObjectWithManifest, BaseModel

Method generated by attrs for class Release.

object_type: typing_extensions.Final = 'release'
name
message
target
target_type
synthetic
author
date
metadata
id: bytes
raw_manifest: Optional[bytes]

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

check_author(attribute, value)[source]

If the author is None, checks the date is None too.

to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]

Returns a SWHID representing this object.

anonymize() Release[source]

Returns an anonymized version of the Release object.

Anonymization consists in replacing the author with an anonymized Person object.

class swh.model.model.RevisionType(value)[source]

Bases: Enum

An enumeration.

GIT = 'git'
TAR = 'tar'
DSC = 'dsc'
SUBVERSION = 'svn'
MERCURIAL = 'hg'
CVS = 'cvs'
BAZAAR = 'bzr'
swh.model.model.tuplify_extra_headers(value: Iterable)[source]
class swh.model.model.Revision(message: Optional[bytes], author: Optional[Person], committer: Optional[Person], date: Optional[TimestampWithTimezone], committer_date: Optional[TimestampWithTimezone], type: RevisionType, directory: bytes, synthetic: bool, metadata: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]] = None, parents: Tuple[bytes, ...] = (), id: bytes = b'', extra_headers: Iterable = (), raw_manifest: Optional[bytes] = None)[source]

Bases: HashableObjectWithManifest, BaseModel

Method generated by attrs for class Revision.

object_type: typing_extensions.Final = 'revision'
message
author
committer
date
committer_date
type
directory
synthetic
metadata
parents
id: bytes
extra_headers
raw_manifest: Optional[bytes]

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

check_author(attribute, value)[source]

If the author is None, checks the date is None too.

check_committer(attribute, value)[source]

If the committer is None, checks the committer_date is None too.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]

Returns a SWHID representing this object.

anonymize() Revision[source]

Returns an anonymized version of the Revision object.

Anonymization consists in replacing the author and committer with an anonymized Person object.

class swh.model.model.DirectoryEntry(name: bytes, type: str, target: bytes, perms)[source]

Bases: BaseModel

Method generated by attrs for class DirectoryEntry.

object_type: typing_extensions.Final = 'directory_entry'
name
type
target
perms

Usually one of the values of swh.model.from_disk.DentryPerms.

check_name(attribute, value)[source]
class swh.model.model.Directory(entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: Optional[bytes] = None)[source]

Bases: HashableObjectWithManifest, BaseModel

Method generated by attrs for class Directory.

object_type: typing_extensions.Final = 'directory'
entries
id: bytes
raw_manifest: Optional[bytes]

Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.

This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.

check_entries(attribute, value)[source]
classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() CoreSWHID[source]

Returns a SWHID representing this object.

classmethod from_possibly_duplicated_entries(*, entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: Optional[bytes] = None) Tuple[bool, Directory][source]

Constructs a Directory object from a list of entries that may contain duplicated names.

This is required to represent legacy objects, that were ingested in the storage database before this check was added.

As it is impossible for a Directory instances to have more than one entry with a given names, this function computes a raw_manifest and renames one of the entries before constructing the Directory.

Returns

(is_corrupt, directory) where is_corrupt is True iff some entry names were indeed duplicated

class swh.model.model.BaseContent(status: str)[source]

Bases: BaseModel

Method generated by attrs for class BaseContent.

status
classmethod from_dict(d, use_subclass=True)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

get_hash(hash_name)[source]
hashes() Dict[str, bytes][source]

Returns a dictionary {hash_name: hash_value}

class swh.model.model.Content(sha1: bytes, sha1_git: bytes, sha256: bytes, blake2s256: bytes, length: int, status: str = 'visible', data: Optional[bytes] = None, ctime: Optional[datetime] = None)[source]

Bases: BaseContent

Method generated by attrs for class Content.

object_type: typing_extensions.Final = 'content'
sha1
sha1_git
sha256
blake2s256
length
status
data
ctime
check_length(attribute, value)[source]

Checks the length is positive.

check_ctime(attribute, value)[source]

Checks the ctime has a timezone.

to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_data(data, status='visible', ctime=None) Content[source]

Generate a Content from a given data byte string.

This populates the Content with the hashes and length for the data passed as argument, as well as the data itself.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

with_data() Content[source]

Loads the data attribute; meaning that it is guaranteed not to be None after this call.

This call is almost a no-op, but subclasses may overload this method to lazy-load data (eg. from disk or objstorage).

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

swhid() CoreSWHID[source]

Returns a SWHID representing this object.

class swh.model.model.SkippedContent(sha1: Optional[bytes], sha1_git: Optional[bytes], sha256: Optional[bytes], blake2s256: Optional[bytes], length: Optional[int], status: str, reason: Optional[str] = None, origin: Optional[str] = None, ctime: Optional[datetime] = None)[source]

Bases: BaseContent

Method generated by attrs for class SkippedContent.

object_type: typing_extensions.Final = 'skipped_content'
sha1
sha1_git
sha256
blake2s256
length
status
reason
origin
ctime
check_reason(attribute, value)[source]

Checks the reason is full if status != absent.

check_length(attribute, value)[source]

Checks the length is positive or -1.

check_ctime(attribute, value)[source]

Checks the ctime has a timezone.

to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_data(data: bytes, reason: str, ctime: Optional[datetime] = None) SkippedContent[source]

Generate a SkippedContent from a given data byte string.

This populates the SkippedContent with the hashes and length for the data passed as argument.

You can use attr.evolve on such a generated content to nullify some of its attributes, e.g. for tests.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

class swh.model.model.MetadataAuthorityType(value)[source]

Bases: Enum

An enumeration.

DEPOSIT_CLIENT = 'deposit_client'
FORGE = 'forge'
REGISTRY = 'registry'
class swh.model.model.MetadataAuthority(type: MetadataAuthorityType, url: str, metadata: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]] = None)[source]

Bases: BaseModel

Represents an entity that provides metadata about an origin or software artifact.

Method generated by attrs for class MetadataAuthority.

object_type: typing_extensions.Final = 'metadata_authority'
type
url
metadata
to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

class swh.model.model.MetadataFetcher(name: str, version: str, metadata: Union[None, Dict[KT, VT], ImmutableDict[KT, VT]] = None)[source]

Bases: BaseModel

Represents a software component used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive.

Method generated by attrs for class MetadataFetcher.

object_type: typing_extensions.Final = 'metadata_fetcher'
name
version
metadata
to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

unique_key() Union[Dict[str, str], Dict[str, bytes], bytes][source]

Returns a unique key for this object, that can be used for deduplication.

swh.model.model.normalize_discovery_date(value: Any) datetime[source]
class swh.model.model.RawExtrinsicMetadata(target: ExtendedSWHID, discovery_date: Any, authority: MetadataAuthority, fetcher: MetadataFetcher, format: str, metadata: bytes, origin: Optional[str] = None, visit: Optional[int] = None, snapshot: Optional[CoreSWHID] = None, release: Optional[CoreSWHID] = None, revision: Optional[CoreSWHID] = None, path: Optional[bytes] = None, directory: Optional[CoreSWHID] = None, id: bytes = b'')[source]

Bases: HashableObject, BaseModel

Method generated by attrs for class RawExtrinsicMetadata.

object_type: typing_extensions.Final = 'raw_extrinsic_metadata'
target
discovery_date
authority
fetcher
format
metadata
origin
visit
snapshot
release
revision
path
directory
id: bytes
check_origin(attribute, value)[source]
check_visit(attribute, value)[source]
check_snapshot(attribute, value)[source]
check_release(attribute, value)[source]
check_revision(attribute, value)[source]
check_path(attribute, value)[source]
check_directory(attribute, value)[source]
to_dict()[source]

Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.

classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

swhid() ExtendedSWHID[source]

Returns a SWHID representing this RawExtrinsicMetadata object.

class swh.model.model.ExtID(extid_type: str, extid: bytes, target: CoreSWHID, extid_version: int = 0, id: bytes = b'')[source]

Bases: HashableObject, BaseModel

Method generated by attrs for class ExtID.

object_type: typing_extensions.Final = 'extid'
extid_type
extid
target
extid_version
id: bytes
classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.