============================================== Getting Started with the Software Heritage API ============================================== Introduction ------------ About Software Heritage ^^^^^^^^^^^^^^^^^^^^^^^ The `Software Heritage project `__ was started in 2015 with a rather impressive goal and purpose: Software Heritage is an ambitious initiative that aims at collecting, organizing, preserving and sharing all the source code publicly available in the world. Yes, all source code available in the world. It implies to build an equally impressive infrastructure to hold the huge amount of information represented, make the archive available to the public through a :swh_web:`nice web interface ` and even propose a :ref:`well-documented API ` to access it seamlessly. For the records, there are also :ref:`various datasets available ` for download, with detailed instructions about how to set it up. And, yes it’s huge: the full graph generated from the archive (with only metadata, content is not included) has more than 20b nodes and weights 1.2TB. Overall size of the archive is in the hundreds of TBs. This article presents, and demonstrates the use of, the :swh_web:`Software Heritage API ` to query basic information about archived content and fetch the content of a software project. Terms and Concepts ^^^^^^^^^^^^^^^^^^ For our activity we need to define the following terms and concepts: - The repositories analysed by the SWH are registered as **origins**. Examples of origins are: https://bitbucket.org/anthroweb/apache.git, https://github.com/apache/ant, or other types of sources (debian source packages, npmjs, pypi, cran..). - When repositories are analysed, it creates **snapshots**. Snapshots describe the state of the repository at the time of analysis, and provide links to the repository content. As an example in the case of a git repository, the snapshot links to the list of branches, which themselves link to revisions and releases. - **Revisions** are consistent sets of directories and contents representing the repository at a given time, like in a baseline. They can be conceptually mapped to commits in subversion, to git references, or to source package versions in debian or nmpjs repositories. - Revisions are linked to a **directory**, which itself links to other directories and **contents** (aka blobs). A full list of terms is provided in the `Software Heritage doc `__. Preliminary steps ----------------- This article uses Python 3.x on the client side, and the ``requests`` Python module to manipulate the HTTP requests. Note however that any language that provides HTTP requests (GET, POST) can access the API and could be used. Firstly let’s make sure we have the correct Python version and module installed:: boris@castalia:notebook$ python3 -V Python 3.7.3 boris@castalia:notebooks$ pip3 install requests Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.21.0) boris@castalia:notebook$ Initialise the script --------------------- We need to import a few modules and utilities to play with the Software Heritage API, namely ``json`` and the aforementioned ``requests`` modules. We also define a utility function to pretty-print json data easily: .. code:: python import json import requests # Utility to pretty-print json. def jprint(obj): # create a formatted string of the Python JSON object print(json.dumps(obj, sort_keys=True, indent=4)) The syntax mentioned in the :swh_web:`API documentation ` is rather straightforward. Since we want to read it from the main Software Heritage server, we will use ``https://archive.softwareheritage.org/`` as the basename. All API calls will be forged according to the same syntax: ``https://archive.softwareheritage.org/api/1/`` Request basic Information ------------------------- We want to get some basic information about the main server activity and content. The ``stat`` endpoint provides a summary of the main indexes and some statistics about the archive. We can request a GET on the main counters of the archive using the counters path, as described in the :swh_web:`endpoint documentation `: ``/api/1/stat/counters/`` This API endpoint returns the following information: * **content** is the total number of blobs (files) in the archive. * **directory** is the total number of repositories in the archive. * **origin** is the number of distinct origins (repositories) fetched by the archive bots. * **origin_visits** is the total number of visits across all origins. * **person** is the number of authors (e.g. committers, authors) in the archived files. * **release** is the number of tags retrieved in the archive. * **revision** is the number of revisions stored in the archive. * **skipped_content** is the number of objects which could be imported in the archive. * **snapshot** is the number of snapshots stored in the archive. Note that we use the default JSON format for the output. We could use YAML if we wanted to, with a custom ``Request Headers`` set to ``application/yaml``. .. code-block:: python resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/") counters = resp.json() jprint(counters) .. code-block:: python { "content": 10049535736, "directory": 8390591308, "origin": 156388918, "person": 42263568, "release": 17218891, "revision": 2109783249 } There are almost 10bn blobs (aka files) in the archive and 8bn+ directories already, for 155m repositories analysed. Now, what about a specific repository? Let’s say we want to find if `alambic `__ (an open-source data provider and analysis system for software development) has already been analysed by the archive’s bots. Search the archive ------------------ Search for a keyword ^^^^^^^^^^^^^^^^^^^^ The easiest way to look for a keyword in the repositories analysed by the archive is to use the ``search`` feature of the ``origin`` endpoint. Documentation for the endpoint is :swh_web:`here ` and the complete syntax is: ``/api/1/origin/search//`` The server returns an array of hashes, with each item being formatted as: - **origin_visits_url** attribute is an URL that points to the API page listing all visits (bot fetches) to this repository. - **url** is the url of the origin, or repository, itself. A (truncated) example of a result from this endpoint is shown below: :: [ { "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", "url": "https://github.com/borisbaldassari/alambic" } ... ] As an example we will look for instances of *alambic* in the archive’s analysed repositories:: resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/") origins = resp.json() print(f"We found {len(origins)} entries.") for origin in origins[1:10]: print(f"- {origin['url']}") Which produces:: We found 52 entries. - https://github.com/royal-alambic-club/sauron - https://github.com/scamberlin/alambic - https://github.com/WebTales/alambic-connector-mongodb - https://github.com/WebTales/alambic - https://github.com/AssoAlambic/alambic-website - https://bitbucket.org/nayoub/alambic.git - https://github.com/Alexandru-Dobre/alambic-connector-rest - https://github.com/WebTales/alambic-connector-diffbot - https://github.com/WebTales/alambic-connector-firebase There are obviously many projects and repositories that embed the word alambic, and we will need to be a bit more specific if we are to identify the origin actually related to the alambic project. If we want to know more about a specific origin, we can simply use the ``url`` attribute (or any known URL) as an entry for any of the ``origin`` endpoints. Search for a specific origin ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Now say that we want to query the database for the specific repository of Alambic, to know what information has been registered by the archive. The API endpoint can be found :swh_web:`in the swh-web documentation `, and has the following syntax: ``/api/1/origin//get/`` Which returns the same type of JSON object than the ``search`` command seen previously: - **origin_visits_url** attribute is an URL that points to the API page listing all visits (bot fetches) to this repository. - **url** is the url of the origin, or repository, itself. We know that Alambic is hosted at ‘https://github.com/borisbaldassari/alambic/’, so the API call will look like this: ``/api/1/origin/https://github.com/borisbaldassari/alambic/get/`` .. code:: python resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/") found = resp.json() jprint(found) .. code:: { "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/", "url": "https://github.com/borisbaldassari/alambic" } Get visits information ^^^^^^^^^^^^^^^^^^^^^^ We can use the ``origin_visits_url`` attribute to know more about when the repository was analysed by the archive bots. The API endpoint is fully documented on the :swh_web:`Software Heritage doc site `, and has the following syntax: ``/api/1/origin//visits/`` We will use the same query as before about the main Alambic repository. .. code:: python resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/") found = resp.json() length = len(found) print(f"Number of visits found: {format(length)}.") print("With dates:") for visit in found: print(f"- {visit['visit']} {visit['date']}") print("\nExample of a single visit entry:") jprint(found[0]) .. code:: Number of visits found: 5. With dates: - 5 2021-01-01T19:35:41.308336+00:00 - 4 2020-02-06T10:41:45.700641+00:00 - 3 2019-09-01T22:38:12.056537+00:00 - 2 2019-06-16T04:52:18.162914+00:00 - 1 2019-01-30T07:19:20.799217+00:00 Example of a single visit entry: { "date": "2021-01-01T19:35:41.308336+00:00", "metadata": {}, "origin": "https://github.com/borisbaldassari/alambic", "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/", "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/", "status": "full", "type": "git", "visit": 5 } Get the content --------------- As defined in the beginning, a snapshot is a capture of the repository at a given time with links to all branches and releases. In this example we will work on the snapshot ID of the last visit to Alambic, as returned by the previous command we executed. .. code:: python # Store snapshot id snapshot = found[0]['snapshot'] print(f"Snapshot is {format(snapshot)}.") .. code:: Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc. Note that the latest visit to the repository can also be directly retrieved using the :swh_web:`dedicated endpoint ` ``/api/1/origin/visit/latest/``. Get the snapshot ^^^^^^^^^^^^^^^^ We want now to retrieve the content of the project at this snapshot. For that purpose there is the ``snapshot`` endpoint, and its documentation is :swh_web:`provided here `. The complete syntax is: ``/api/1/snapshot//`` The snapshot endpoint returns in the ``branches`` attribute a list of **revisions** (aka commits in a git context), which themselves point to the set of directories and files in the branch at the time of analysis. Let’s follow this chain of links, starting with the snapshot’s list of revisions (branches): .. code:: python snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot)) snapshotj = snapshotr.json() jprint(snapshotj) .. code:: { "branches": { "HEAD": { "target": "refs/heads/master", "target_type": "alias", "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" }, "refs/heads/devel": { "target": "e298b8c5692b18928013a68e41fd185419515075", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/" }, "refs/heads/features/cr152_anonymise_data": { "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/" }, "refs/heads/features/cr164_github_project": { "target": "0005abb080e4c67a97533ee923e9d28142877752", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" }, "refs/heads/features/cr165_github_its": { "target": "0005abb080e4c67a97533ee923e9d28142877752", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/" }, "refs/heads/features/cr89_gitlabwizard": { "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/" }, "refs/heads/master": { "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", "target_type": "revision", "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" } }, "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc", "next_branch": null } Get the root directory ^^^^^^^^^^^^^^^^^^^^^^ The revision associated to the branch can be retrieved by following the corresponding link in the ``target_url`` attribute. We will follow the ``refs/heads/master`` branch and get the associated revision object. In this case (a git repository) the revision is equivalent to a commit, with an ID and message. .. code:: python print(f"Revision ID is {snapshotj['id']}.") master_url = snapshotj['branches']['refs/heads/master']['target_url'] masterr = requests.get(master_url) masterj = masterr.json() jprint(masterj) .. code:: Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc { "author": { "email": "boris.baldassari@gmail.com", "fullname": "Boris Baldassari ", "name": "Boris Baldassari" }, "committer": { "email": "boris.baldassari@gmail.com", "fullname": "Boris Baldassari ", "name": "Boris Baldassari" }, "committer_date": "2020-11-01T12:55:13+01:00", "date": "2020-11-01T12:55:13+01:00", "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8", "directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/", "extra_headers": [], "history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/", "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19", "merge": false, "message": "#163 Fix dygraphs zero padding in forums plugin.\n", "metadata": {}, "parents": [ { "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151", "url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/" } ], "synthetic": false, "type": "git", "url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/" } The revision references the root directory of the project. We can list all files and directories at the root by requesting more information from the ``directory_url`` attribute. The endpoint is documented :swh_web:`here ` and has the following syntax: ``/api/1/directory//`` The structure of the response is an **array of directory entries**. **Content entries** are represented like this: :: { "checksums": { "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8", "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc" }, "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "length": 101, "name": ".dockerignore", "perms": 33188, "status": "visible", "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b", "target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/", "type": "file" } And **directory entries** are represented with: :: { "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "length": null, "name": "doc", "perms": 16384, "target": "316468df4988351911992ecbf1866f1c1f575c23", "target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/", "type": "dir" } We will print the list of contents and directories located at the root of the repository at the time of analysis: .. code:: python root_url = masterj['directory_url'] rootr = requests.get(root_url) rootj = rootr.json() for f in rootj: print(f"- {f['name']}.") .. code:: - .dockerignore - .env - .gitignore - CODE_OF_CONDUCT.html - CODE_OF_CONDUCT.md - LICENCE.html - LICENCE.md - Readme.md - doc - docker - docker-compose.run.yml - docker-compose.test.yml - dockercfg.encrypted - mojo - resources We could follow the links up (or down) to the leaves in order to rebuild the project structure and download all files individually to rebuild the project locally. However the archive can do it for us, and provides a feature to download the content of a whole project in one step: **cooking**. The feature is described in the :ref:`swh-vault documentation `. Download content of a project ----------------------------- When we ask the Archive to cook a directory for us, it invokes an asynchronous job to recuversively fetch the directories and files of the project, following the graph up to the leaves (files) and exporting the result as a tar.gz file. This procedure is handled by the :ref:`swh-vault component `, and it’s all automatic. Order the meal ^^^^^^^^^^^^^^ A cooking job can be invoked for revisions, directories or snapshots (soon). It is initiated with a POST request on the ``vault//`` endpoint, and its complete syntax is: ``/api/1/vault/directory//`` The first POST request initiates the cooking, and subsequent GET requests can fetch the job result and download the archive. See the `Software Heritage documentation ` on this, with useful examples. The API endpoint is documented :swh_web:`here `. In this example we will fetch the content of the root directory that we previously identified. .. code:: python mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") mealj = mealr.json() jprint(mealj) .. code:: { "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", "id": 379321799, "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "obj_type": "directory", "progress_message": null, "status": "done" } Ask if it’s ready ^^^^^^^^^^^^^^^^^ We can use a GET request on the same URL to get information about the process status: .. code:: python statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/") statusj = statusr.json() jprint(statusj) .. code:: { "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/", "id": 379321799, "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200", "obj_type": "directory", "progress_message": null, "status": "done" } Get the plate ^^^^^^^^^^^^^ Once the processing is finished (it can take up to a few minutes) the tar.gz archive can be downloaded through the ``fetch_url`` link, and extracted as a tar.gz archive: :: boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 9555k 100 9555k 0 0 1459k 0 0:00:06 0:00:06 --:--:-- 1717k boris@castalia:downloads$ ls myarchive.tar.gz boris@castalia:downloads$ tar xzf myarchive.tar.gz 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/ 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md 3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config [SNIP] Conclusion ---------- In this article, we learned **how to explore and use the Software Heritage archive using its API**: searching for a repository, identifying projects and downloading specific snapshots of a repository. There is a lot more to the Archive and its API than what we have seen, and all features are generously documented on the :swh_web:`Software Heritage web site `.