swh.core.s3.downloader module#

class swh.core.s3.downloader.S3Downloader(local_path: Path, s3_url: str, parallelism: int = 5)[source]#

Bases: object

Utility class to recursively download the content of a directory on S3.

It also implements a download resumption feature in case some files fail to be downloaded (when connection errors happen for instance).

Parameters:
  • local_path – path of directory where files will be downloaded

  • s3_url – URL of directory in a S3 bucket (s3://<bucket_name>/<path>/)

  • parallelism – maximum number of threads for downloading files

Example of use:

from swh.core.s3.downloader import S3Downloader

# download "2025-05-18-popular-1k" datasets (ORC and compressed graph)
# into a sub-directory of the current working directory named "2025-05-18-popular-1k"

s3_downloader = S3Downloader(
    local_path="2025-05-18-popular-1k",
    s3_url="s3://softareheritage/graph/2025-05-18-popular-1k/",
)

while not s3_downloader.download():
    continue
download(progress_percent_cb: ~typing.Callable[[int], None] = <function S3Downloader.<lambda>>, progress_status_cb: ~typing.Callable[[str], None] = <function S3Downloader.<lambda>>) bool[source]#

Execute the download of files from S3 in parallel using a pool of threads.

Parameters:

progress_percent_cb

Optional callback function to report the overall

progress of the downloads

progress_status_cb: Optional callback function to get status messages

related to downloaded files

Returns:

True if all files were successfully downloaded, False if an error occurred while downloading a file, in that case calling that method again will resume such incomplete downloads

filter_objects(objects: List[ObjectSummary]) List[ObjectSummary][source]#

Method that can be overridden in derived classes to filter files to download, return all files by default.

Parameters:

objects – list of files recursively discovered from the S3 directory

Returns:

filtered list of files to download

can_download_file(relative_path: str, local_file_path: Path) bool[source]#

Method that can be overridden in derived classes to prevent download of a file under certain conditions, download all files by default.

Parameters:
  • relative_path – path of file relative to the S3 directory

  • local_file_path – local path where the file is downloaded

Returns:

whether to download the file or not

post_download_file(relative_path: str, local_file_path: Path) None[source]#

Method that can be overridden in derived classes to execute a post processing on a downloaded file (uncompress it for instance).

Parameters:
  • relative_path – path of file relative to the S3 directory

  • local_file_path – local path where the file is downloaded

post_downloads() None[source]#

Method that can be overridden in derived classes to execute a post processing after all files were downloaded.