swh.graph.download module#

class swh.graph.download.GraphDownloader(local_path: Path, s3_url: str, parallelism: int = 5)[source]#

Bases: S3Downloader

Utility class to download a compressed Software Heritage graph dataset from S3 implementing a download resumption feature in case some files fail to be downloaded (when connection errors happen for instance).

Example of use:

from swh.graph.download import GraphDownloader

# download "2025-05-18-popular-1k" graph dataset into a sub-directory of the
# current working directory named "2025-05-18-popular-1k"

graph_downloader = GraphDownloader(
    local_path="2025-05-18-popular-1k",
    s3_url="s3://softareheritage/graph/2025-05-18-popular-1k/compressed/"
)

while not graph_downloader.download():
    continue
filter_objects(objects: List[ObjectSummary]) List[ObjectSummary][source]#

Method that can be overridden in derived classes to filter files to download, return all files by default.

Parameters:

objects – list of files recursively discovered from the S3 directory

Returns:

filtered list of files to download

can_download_file(relative_path: str, local_file_path: Path) bool[source]#

Method that can be overridden in derived classes to prevent download of a file under certain conditions, download all files by default.

Parameters:
  • relative_path – path of file relative to the S3 directory

  • local_file_path – local path where the file is downloaded

Returns:

whether to download the file or not

post_download_file(relative_path: str, local_file_path: Path) None[source]#

Method that can be overridden in derived classes to execute a post processing on a downloaded file (uncompress it for instance).

Parameters:
  • relative_path – path of file relative to the S3 directory

  • local_file_path – local path where the file is downloaded

post_downloads() None[source]#

Method that can be overridden in derived classes to execute a post processing after all files were downloaded.