swh.datasets.download module#

class swh.datasets.download.DatasetDownloader(local_path: Path, s3_url: str, parallelism: int = 5)[source]#

Bases: S3Downloader

Utility class to help downloading SWH datasets (ORC exports for instance) from S3.

It also implements a download resumption feature in case some files fail to be downloaded (when connection errors happen for instance).

Example of use:

from swh.datasets.download import DatasetDownloader

# download "2025-05-18-popular-1k" ORC dataset into a sub-directory of the
# current working directory named "2025-05-18-popular-1k-orc"

dataset_downloader = DatasetDownloader(
    local_path="2025-05-18-popular-1k-orc",
    s3_url="s3://softareheritage/graph/2025-05-18-popular-1k/orc/",
)

while not dataset_downloader.download():
    continue
filter_objects(objects: List[ObjectSummary]) List[ObjectSummary][source]#

Method that can be overridden in derived classes to filter files to download, return all files by default.

Parameters:

objects – list of files recursively discovered from the S3 directory

Returns:

filtered list of files to download

post_downloads() None[source]#

Method that can be overridden in derived classes to execute a post processing after all files were downloaded.