swh.datasets.download module#
- class swh.datasets.download.DatasetDownloader(local_path: Path, s3_url: str, parallelism: int = 5)[source]#
Bases:
S3DownloaderUtility class to help downloading SWH datasets (ORC exports for instance) from S3.
It also implements a download resumption feature in case some files fail to be downloaded (when connection errors happen for instance).
Example of use:
from swh.datasets.download import DatasetDownloader # download "2025-05-18-popular-1k" ORC dataset into a sub-directory of the # current working directory named "2025-05-18-popular-1k-orc" dataset_downloader = DatasetDownloader( local_path="2025-05-18-popular-1k-orc", s3_url="s3://softareheritage/graph/2025-05-18-popular-1k/orc/", ) while not dataset_downloader.download(): continue
- filter_objects(objects: List[ObjectSummary]) List[ObjectSummary][source]#
Method that can be overridden in derived classes to filter files to download, return all files by default.
- Parameters:
objects – list of files recursively discovered from the S3 directory
- Returns:
filtered list of files to download