swh.dataset.exporters.orc module#

swh.dataset.exporters.orc.hash_to_hex_or_none(hash)[source]#
swh.dataset.exporters.orc.swh_date_to_tuple(obj)[source]#
swh.dataset.exporters.orc.datetime_to_tuple(obj: datetime | None) Tuple[int, int] | None[source]#
class swh.dataset.exporters.orc.SWHTimestampConverter[source]#

Bases: object

This is an ORCConverter compatible class to convert timestamps from/to ORC files

timestamps in python are given as a couple (seconds, microseconds) and are serialized as a couple (seconds, nanoseconds) in the ORC file.

Reimplemented because we do not want the Python object to be converted as ORC timestamp to be Python datatime objects, since swh.model’s Timestamp cannot be converted without loss a Python datetime objects.

static from_orc(seconds: int, nanoseconds: int, timezone: Any) Tuple[int, int][source]#
static to_orc(obj: Tuple[int, int] | None, timezone: Any) Tuple[int, int] | None[source]#
class swh.dataset.exporters.orc.ORCExporter(*args, **kwargs)[source]#

Bases: ExporterDispatch

Implementation of an exporter which writes the entire graph dataset as ORC files. Useful for large scale processing, notably on cloud instances (e.g BigQuery, Amazon Athena, Azure).

maybe_close_writer_for(table_name: str)[source]#
get_writer_for(table_name: str, unique_id=None)[source]#
process_origin(origin)[source]#
process_origin_visit(visit)[source]#
process_origin_visit_status(visit_status)[source]#
process_snapshot(snapshot)[source]#
process_release(release)[source]#
process_revision(revision)[source]#
process_directory(directory)[source]#
process_content(content)[source]#
process_skipped_content(skipped_content)[source]#