swh.core.s3.downloader module#
- class swh.core.s3.downloader.S3Downloader(local_path: Path, s3_url: str, parallelism: int = 5)[source]#
Bases:
objectUtility class to recursively download the content of a directory on S3.
It also implements a download resumption feature in case some files fail to be downloaded (when connection errors happen for instance).
- Parameters:
local_path – path of directory where files will be downloaded
s3_url – URL of directory in a S3 bucket (
s3://<bucket_name>/<path>/)parallelism – maximum number of threads for downloading files
Example of use:
from swh.core.s3.downloader import S3Downloader # download "2025-05-18-popular-1k" datasets (ORC and compressed graph) # into a sub-directory of the current working directory named "2025-05-18-popular-1k" s3_downloader = S3Downloader( local_path="2025-05-18-popular-1k", s3_url="s3://softareheritage/graph/2025-05-18-popular-1k/", ) while not s3_downloader.download(): continue
- download(progress_percent_cb: ~typing.Callable[[int], None] = <function S3Downloader.<lambda>>, progress_status_cb: ~typing.Callable[[str], None] = <function S3Downloader.<lambda>>) bool[source]#
Execute the download of files from S3 in parallel using a pool of threads.
- Parameters:
progress_percent_cb –
- Optional callback function to report the overall
progress of the downloads
- progress_status_cb: Optional callback function to get status messages
related to downloaded files
- Returns:
Trueif all files were successfully downloaded,Falseif an error occurred while downloading a file, in that case calling that method again will resume such incomplete downloads
- filter_objects(objects: List[ObjectSummary]) List[ObjectSummary][source]#
Method that can be overridden in derived classes to filter files to download, return all files by default.
- Parameters:
objects – list of files recursively discovered from the S3 directory
- Returns:
filtered list of files to download
- can_download_file(relative_path: str, local_file_path: Path) bool[source]#
Method that can be overridden in derived classes to prevent download of a file under certain conditions, download all files by default.
- Parameters:
relative_path – path of file relative to the S3 directory
local_file_path – local path where the file is downloaded
- Returns:
whether to download the file or not
- post_download_file(relative_path: str, local_file_path: Path) None[source]#
Method that can be overridden in derived classes to execute a post processing on a downloaded file (uncompress it for instance).
- Parameters:
relative_path – path of file relative to the S3 directory
local_file_path – local path where the file is downloaded