swh.graph.download module#
- class swh.graph.download.GraphDownloader(local_path: Path, s3_url: str, parallelism: int = 5)[source]#
Bases:
S3DownloaderUtility class to download a compressed Software Heritage graph dataset from S3 implementing a download resumption feature in case some files fail to be downloaded (when connection errors happen for instance).
Example of use:
from swh.graph.download import GraphDownloader # download "2025-05-18-popular-1k" graph dataset into a sub-directory of the # current working directory named "2025-05-18-popular-1k" graph_downloader = GraphDownloader( local_path="2025-05-18-popular-1k", s3_url="s3://softareheritage/graph/2025-05-18-popular-1k/compressed/" ) while not graph_downloader.download(): continue
- filter_objects(objects: List[ObjectSummary]) List[ObjectSummary][source]#
Method that can be overridden in derived classes to filter files to download, return all files by default.
- Parameters:
objects – list of files recursively discovered from the S3 directory
- Returns:
filtered list of files to download
- can_download_file(relative_path: str, local_file_path: Path) bool[source]#
Method that can be overridden in derived classes to prevent download of a file under certain conditions, download all files by default.
- Parameters:
relative_path – path of file relative to the S3 directory
local_file_path – local path where the file is downloaded
- Returns:
whether to download the file or not
- post_download_file(relative_path: str, local_file_path: Path) None[source]#
Method that can be overridden in derived classes to execute a post processing on a downloaded file (uncompress it for instance).
- Parameters:
relative_path – path of file relative to the S3 directory
local_file_path – local path where the file is downloaded