swh.model.model module#
Implementation of Software Heritage’s data model
See Data model for an overview of the data model.
The classes defined in this module are immutable attrs objects and enums.
All classes define a from_dict
class method and a to_dict
method to convert between them and msgpack-serializable objects.
- exception swh.model.model.MissingData[source]#
Bases:
Exception
Raised by Content.with_data when it has no way of fetching the data (but not when fetching the data fails).
- swh.model.model.KeyType#
The type returned by BaseModel.unique_key().
- swh.model.model.freeze_optional_dict(d: None | Dict | ImmutableDict) ImmutableDict | None [source]#
- swh.model.model.generic_type_validator(instance, attribute, value)[source]#
validates the type of an attribute value whatever the attribute type
- swh.model.model.optimize_all_validators(cls, old_fields)[source]#
process validators to turn them into a faster version … eventually
- class swh.model.model.ModelObjectType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
_StringCompatibleEnum
Possible object types of Model object
- CONTENT = 'content'#
- DIRECTORY = 'directory'#
- DIRECTORY_ENTRY = 'directory_entry'#
- EXTID = 'extid'#
- METADATA_AUTHORITY = 'metadata_authority'#
- METADATA_FETCHER = 'metadata_fetcher'#
- ORIGIN = 'origin'#
- ORIGIN_VISIT = 'origin_visit'#
- ORIGIN_VISIT_STATUS = 'origin_visit_status'#
- PERSON = 'person'#
- RAW_EXTRINSIC_METADATA = 'raw_extrinsic_metadata'#
- RELEASE = 'release'#
- REVISION = 'revision'#
- SKIPPED_CONTENT = 'skipped_content'#
- SNAPSHOT = 'snapshot'#
- SNAPSHOT_BRANCH = 'snapshot_branch'#
- TIMESTAMP = 'timestamp'#
- TIMESTAMP_WITH_TIMEZONE = 'timestamp_with_timezone'#
- class swh.model.model.BaseModel[source]#
Bases:
ABC
Base class for SWH model classes.
Provides serialization/deserialization to/from Python dictionaries, that are suitable for JSON/msgpack-like formats.
- abstract property object_type: ModelObjectType#
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- evolve(**kwargs) ModelType [source]#
Alias to call
attr.evolve()
on this object, returning a new object.
- anonymize() ModelType | None [source]#
Returns an anonymized version of the object, if needed.
If the object model does not need/support anonymization, returns None.
- class swh.model.model.BaseHashableModel[source]#
-
Mixin to automatically compute object identifier hash when the associated model is instantiated.
- compute_hash() bytes [source]#
Derived model classes must implement this to compute the object hash.
This method is called by the object initialization if the id attribute is set to an empty value.
- evolve(**kwargs) HashableModelType [source]#
Alias to call
attr.evolve()
on this object, returning a new object with itsid
recomputed based on the content.
- swh.model.model.HashableObject#
alias of
BaseHashableModel
- class swh.model.model.HashableObjectWithManifest[source]#
Bases:
BaseHashableModel
Derived class of BaseHashableModel, for objects that may need to store verbatim git objects as
raw_manifest
to preserve original hashes.- raw_manifest: bytes | None = None#
Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.
This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- class swh.model.model.Person(fullname: bytes, name: bytes | None, email: bytes | None)[source]#
Bases:
BaseModel
Represents the author/committer of a revision or release.
Method generated by attrs for class Person.
- fullname#
- name#
- email#
- classmethod from_fullname(fullname: bytes)[source]#
Returns a Person object, by guessing the name and email from the fullname, in the name <email> format.
The fullname is left unchanged.
- class swh.model.model.Timestamp(seconds: int, microseconds: int)[source]#
Bases:
BaseModel
Represents a naive timestamp from a VCS.
Method generated by attrs for class Timestamp.
- seconds#
- microseconds#
- class swh.model.model.TimestampWithTimezone(timestamp: Timestamp, offset_bytes: bytes)[source]#
Bases:
BaseModel
Represents a TZ-aware timestamp from a VCS.
Method generated by attrs for class TimestampWithTimezone.
- timestamp#
- offset_bytes#
Raw git representation of the timezone, as an offset from UTC. It should follow this format:
+HHMM
or-HHMM
(including+0000
and-0000
).However, when created from git objects, it must be the exact bytes used in the original objects, so it may differ from this format when they do.
- classmethod from_numeric_offset(timestamp: Timestamp, offset: int, negative_utc: bool) TimestampWithTimezone [source]#
Returns a
TimestampWithTimezone
instance from the old dictionary format (withoffset
andnegative_utc
instead ofoffset_bytes
).
- classmethod from_dict(time_representation: Dict | datetime | int) TimestampWithTimezone [source]#
Builds a TimestampWithTimezone from any of the formats accepted by
swh.model.normalize_timestamp()
.
- classmethod from_datetime(dt: datetime) TimestampWithTimezone [source]#
- to_datetime() datetime [source]#
Convert to a datetime (with a timezone set to the recorded fixed UTC offset)
Beware that this conversion can be lossy:
-0000
and ‘weird’ offsets cannot be represented. Also note that it may fail due to type overflow.
- classmethod from_iso8601(s)[source]#
Builds a TimestampWithTimezone from an ISO8601-formatted string.
- offset_minutes()[source]#
Returns the offset, as a number of minutes since UTC.
>>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0000" ... ).offset_minutes() 0 >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0200" ... ).offset_minutes() 120 >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"-0200" ... ).offset_minutes() -120 >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0530" ... ).offset_minutes() 330
- class swh.model.model.Origin(url: str, id: bytes = b'')[source]#
Bases:
BaseHashableModel
Represents a software source: a VCS and an URL.
Method generated by attrs for class Origin.
- url#
- unique_key() Dict[str, str] | Dict[str, bytes] | bytes [source]#
Returns a unique key for this object, that can be used for deduplication.
- swhid() ExtendedSWHID [source]#
Returns a SWHID representing this origin.
- class swh.model.model.OriginVisit(origin: str, date: datetime, type: str, visit: int | None = None)[source]#
Bases:
BaseModel
Represents an origin visit with a given type at a given point in time, by a SWH loader.
Method generated by attrs for class OriginVisit.
- origin#
- date#
- type#
Should not be set before calling ‘origin_visit_add()’.
- visit#
- class swh.model.model.OriginVisitStatus(origin: str, visit: int, date: datetime, status: str, snapshot: bytes | None, type: str | None = None, metadata: None | Dict | ImmutableDict = None)[source]#
Bases:
BaseModel
Represents a visit update of an origin at a given point in time.
Method generated by attrs for class OriginVisitStatus.
- origin#
- visit#
- date#
- status#
- snapshot#
- type#
- metadata#
- unique_key() Dict[str, str] | Dict[str, bytes] | bytes [source]#
Returns a unique key for this object, that can be used for deduplication.
- origin_swhid() ExtendedSWHID [source]#
- class swh.model.model.SnapshotTargetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
The type of content pointed to by a snapshot branch. Usually a revision or an alias.
- CONTENT = 'content'#
- DIRECTORY = 'directory'#
- REVISION = 'revision'#
- RELEASE = 'release'#
- SNAPSHOT = 'snapshot'#
- ALIAS = 'alias'#
- swh.model.model.TargetType#
alias of
SnapshotTargetType
- class swh.model.model.ReleaseTargetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
The type of content pointed to by a release. Usually a revision
- CONTENT = 'content'#
- DIRECTORY = 'directory'#
- REVISION = 'revision'#
- RELEASE = 'release'#
- SNAPSHOT = 'snapshot'#
- swh.model.model.ObjectType#
alias of
ReleaseTargetType
- class swh.model.model.SnapshotBranch(target: bytes, target_type: SnapshotTargetType)[source]#
Bases:
BaseModel
Represents one of the branches of a snapshot.
Method generated by attrs for class SnapshotBranch.
- target#
- target_type#
- check_target(attribute, value)[source]#
Checks the target type is not an alias, checks the target is a valid sha1_git.
- class swh.model.model.Snapshot(branches: None | Dict | ImmutableDict, id: bytes = b'')[source]#
Bases:
BaseHashableModel
Represents the full state of an origin at a given point in time.
Method generated by attrs for class Snapshot.
- branches#
- class swh.model.model.Release(name: bytes, message: bytes | None, target: bytes | None, target_type: ReleaseTargetType, synthetic: bool, author: Person | None = None, date: TimestampWithTimezone | None = None, metadata: None | Dict | ImmutableDict = None, id: bytes = b'', raw_manifest: bytes | None = None)[source]#
Bases:
HashableObjectWithManifest
,BaseModel
Method generated by attrs for class Release.
- name#
- message#
- target#
- target_type#
- synthetic#
- author#
- date#
- metadata#
- raw_manifest: bytes | None#
Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.
This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- class swh.model.model.RevisionType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- GIT = 'git'#
- TAR = 'tar'#
- DSC = 'dsc'#
- SUBVERSION = 'svn'#
- MERCURIAL = 'hg'#
- CVS = 'cvs'#
- BAZAAR = 'bzr'#
- class swh.model.model.Revision(message: bytes | None, author: Person | None, committer: Person | None, date: TimestampWithTimezone | None, committer_date: TimestampWithTimezone | None, type: RevisionType, directory: bytes, synthetic: bool, metadata: None | Dict | ImmutableDict = None, parents: Tuple[bytes, ...] = (), id: bytes = b'', extra_headers: Iterable = (), raw_manifest: bytes | None = None)[source]#
Bases:
HashableObjectWithManifest
,BaseModel
Method generated by attrs for class Revision.
- message#
- author#
- committer#
- date#
- committer_date#
- type#
- directory#
- synthetic#
- metadata#
- parents#
- extra_headers#
- raw_manifest: bytes | None#
Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.
This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.
- check_committer(attribute, value)[source]#
If the committer is None, checks the committer_date is None too.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- class swh.model.model.DirectoryEntry(name: bytes, type: str, target: bytes, perms)[source]#
Bases:
BaseModel
Method generated by attrs for class DirectoryEntry.
- name#
- type#
- target#
- perms#
Usually one of the values of swh.model.from_disk.DentryPerms.
- DIR_ENTRY_TYPE_TO_SWHID_OBJECT_TYPE = {'dir': ObjectType.DIRECTORY, 'file': ObjectType.CONTENT, 'rev': ObjectType.REVISION}#
- class swh.model.model.Directory(entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: bytes | None = None)[source]#
Bases:
HashableObjectWithManifest
,BaseModel
Method generated by attrs for class Directory.
- entries#
- raw_manifest: bytes | None#
Stores the original content of git objects when they cannot be faithfully represented using only the other attributes.
This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- classmethod from_possibly_duplicated_entries(*, entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: bytes | None = None) Tuple[bool, Directory] [source]#
Constructs a
Directory
object from a list of entries that may contain duplicated names.This is required to represent legacy objects, that were ingested in the storage database before this check was added.
As it is impossible for a
Directory
instances to have more than one entry with a given names, this function computes araw_manifest
and renames one of the entries before constructing theDirectory
.- Returns:
(is_corrupt, directory)
whereis_corrupt
is True iff some entry names were indeed duplicated
- class swh.model.model.BaseContent(status: str)[source]#
-
Method generated by attrs for class BaseContent.
- status#
- class swh.model.model.Content(sha1: bytes, sha1_git: bytes, sha256: bytes, blake2s256: bytes, length: int, status: str = 'visible', data: bytes | None = None, get_data: Callable[[], bytes] | None = None, ctime: datetime | None = None)[source]#
Bases:
BaseContent
Method generated by attrs for class Content.
- sha1#
- sha1_git#
- sha256#
- blake2s256#
- length#
- status#
- data#
- get_data#
- ctime#
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- classmethod from_data(data, status='visible', ctime=None) Content [source]#
Generate a Content from a given data byte string.
This populates the Content with the hashes and length for the data passed as argument, as well as the data itself.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- with_data(raise_if_missing: bool = True) Content [source]#
Loads the
data
attribute ifget_data
is notNone
.This call is almost a no-op, but subclasses may overload this method to lazy-load data (eg. from disk or objstorage).
- Parameters:
raise_if_missing – if
True
(default), raiseMissingData
exception if no data is attached to content object
- class swh.model.model.SkippedContent(sha1: bytes | None, sha1_git: bytes | None, sha256: bytes | None, blake2s256: bytes | None, length: int | None, status: str, reason: str | None = None, origin: str | None = None, ctime: datetime | None = None)[source]#
Bases:
BaseContent
Method generated by attrs for class SkippedContent.
- sha1#
- sha1_git#
- sha256#
- blake2s256#
- length#
- status#
- reason#
- origin#
- ctime#
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- classmethod from_data(data: bytes, reason: str, ctime: datetime | None = None) SkippedContent [source]#
Generate a SkippedContent from a given data byte string.
This populates the SkippedContent with the hashes and length for the data passed as argument.
You can use attr.evolve on such a generated content to nullify some of its attributes, e.g. for tests.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- class swh.model.model.MetadataAuthorityType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- DEPOSIT_CLIENT = 'deposit_client'#
- FORGE = 'forge'#
- REGISTRY = 'registry'#
- class swh.model.model.MetadataAuthority(type: MetadataAuthorityType, url: str, metadata: None | Dict | ImmutableDict = None)[source]#
Bases:
BaseModel
Represents an entity that provides metadata about an origin or software artifact.
Method generated by attrs for class MetadataAuthority.
- type#
- url#
- metadata#
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- class swh.model.model.MetadataFetcher(name: str, version: str, metadata: None | Dict | ImmutableDict = None)[source]#
Bases:
BaseModel
Represents a software component used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive.
Method generated by attrs for class MetadataFetcher.
- name#
- version#
- metadata#
- class swh.model.model.RawExtrinsicMetadata(target: ExtendedSWHID, discovery_date: Any, authority: MetadataAuthority, fetcher: MetadataFetcher, format: str, metadata: bytes, origin: str | None = None, visit: int | None = None, snapshot: CoreSWHID | None = None, release: CoreSWHID | None = None, revision: CoreSWHID | None = None, path: bytes | None = None, directory: CoreSWHID | None = None, id: bytes = b'')[source]#
Bases:
BaseHashableModel
Method generated by attrs for class RawExtrinsicMetadata.
- target#
- discovery_date#
- authority#
- fetcher#
- format#
- metadata#
- origin#
- visit#
- snapshot#
- release#
- revision#
- path#
- directory#
- to_dict()[source]#
Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields.
- classmethod from_dict(d)[source]#
Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.
- swhid() ExtendedSWHID [source]#
Returns a SWHID representing this RawExtrinsicMetadata object.
- class swh.model.model.ExtID(extid_type: str, extid: bytes, target: CoreSWHID, extid_version: int = 0, payload_type: str | None = None, payload: bytes | None = None, id: bytes = b'')[source]#
Bases:
BaseHashableModel
Method generated by attrs for class ExtID.
- extid_type#
- extid#
- target#
- extid_version#
- payload_type#
- payload#