Exporting a subdataset#

Because the entire graph is often too big to be practical for many research use cases, notably for prototyping, it is generally useful to publish “subdatasets” which only contain a subset of the entire graph. An example of a very useful subdataset is the graph containing only the top 1000 most popular GitHub repositories (sorted by number of stars).

This page details the various steps required to export a graph subdataset using swh-graph and Amazon Athena.

Step 1. Obtain the list of origins#

You first need to obtain a list of origins that you want to include in the subdataset. Depending on the type of subdataset you want to create, this can be done in various ways, either manual or automated. The following is an example of how to get the list of the 1000 most popular GitHub repositories in the Python language, sorted by number of stars:

for i in $( seq 1 10 ); do \
    curl -G https://api.github.com/search/repositories \
        -d "page=$i" \
        -d "s=stars" -d "order=desc" -d "q=language:python" -d 'per_page=100' | \
        jq --raw-output '.items[].html_url'; \
    sleep 6; \
done > origins.txt

Step 2. Build the list of SWHIDs#

To generate a subdataset from an existing dataset, you need to generate the list of all the SWHIDs to include in the subdataset. The best way to achieve that is to use the compressed graph to perform a full visit of the compressed graph starting from the origin nodes, and to return the list of all the SWHIDs that are reachable from these origins.

Unfortunately, there is currently no endpoint in the HTTP API to start a traversal from multiple nodes. The current best way to achieve this is therefore to visit the graph starting from each origin, one by one, and then to merge all the resulting lists of SWHIDs into a single sorted list of unique SWHIDs.

If you use the internal graph API, you might need to convert the origin URLs in the Extended SWHID format (swh:ori:1:<sha1(url)>) to query the API.

Step 3. Generate the subdataset on Athena#

Once you have obtained a text file containing all the SWHIDs to be included in the new dataset, it is possible to use AWS Athena to JOIN this list of SWHIDs with the tables of an existing dataset, and write the output as a new ORC dataset.

First, make sure that your base dataset containing the entire graph is available as a database on AWS Athena, which can be set up by following the steps described in Setup on Amazon Athena.

The subdataset can then be generated with the swh dataset athena gensubdataset command:

swh dataset athena gensubdataset \
    --swhids swhids.csv \
    --database swh_20210323
    --subdataset-database swh_20210323_popular3kpython \
    --subdataset-location s3://softwareheritage/graph/2021-03-23-popular-3k-python/

Step 4. Upload and document the newly generated subdataset#

After having executed the previous step, there should now be a new dataset located at the S3 path given as the parameter to --subdataset-location. You can upload, publish and document this new subdataset by following the procedure described in Exporting a dataset.