==============================================
Getting Started with the Software Heritage API
==============================================

Introduction
------------

About Software Heritage
^^^^^^^^^^^^^^^^^^^^^^^

The `Software Heritage project <https://www.softwareheritage.org>`__ was
started in 2015 with a rather impressive goal and purpose:

   Software Heritage is an ambitious initiative that aims at collecting,
   organizing, preserving and sharing all the source code publicly
   available in the world.

Yes, all source code available in the world. It implies to build an equally impressive
infrastructure to hold the huge amount of information represented, make the archive
available to the public through a :swh_web:`nice web interface </>` and even propose a
:ref:`well-documented API <swh-web>` to access it seamlessly. For the records, there are
also :ref:`various datasets available <swh-dataset>` for download, with detailed
instructions about how to set it up. And, yes it’s huge: the full graph generated from
the archive (with only metadata, content is not included) has more than 20b nodes and
weights 1.2TB. Overall size of the archive is in the hundreds of TBs.

This article presents, and demonstrates the use of, the :swh_web:`Software Heritage API
<api/1/>` to query basic information about archived content and fetch the content of a
software project.

Terms and Concepts
^^^^^^^^^^^^^^^^^^

For our activity we need to define the following terms and concepts:

-  The repositories analysed by the SWH are registered as **origins**.
   Examples of origins are: https://bitbucket.org/anthroweb/apache.git,
   https://github.com/apache/ant, or other types of sources (debian
   source packages, npmjs, pypi, cran..).
-  When repositories are analysed, it creates **snapshots**. Snapshots
   describe the state of the repository at the time of analysis, and
   provide links to the repository content. As an example in the case of a git
   repository, the snapshot links to the list of branches, which
   themselves link to revisions and releases.
-  **Revisions** are consistent sets of directories and contents
   representing the repository at a given time, like in a baseline. They
   can be conceptually mapped to commits in subversion, to git
   references, or to source package versions in debian or nmpjs
   repositories.
-  Revisions are linked to a **directory**, which itself links to other
   directories and **contents** (aka blobs).

A full list of terms is provided in the `Software Heritage
doc <https://wiki.softwareheritage.org/index.php?title=Glossary>`__.

Preliminary steps
-----------------

This article uses Python 3.x on the client side, and the ``requests``
Python module to manipulate the HTTP requests. Note however that any
language that provides HTTP requests (GET, POST) can access the API and
could be used. Firstly let’s make sure we have the correct Python
version and module installed::

   boris@castalia:notebook$ python3 -V
   Python 3.7.3
   boris@castalia:notebooks$ pip3 install requests
   Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.21.0)
   boris@castalia:notebook$

Initialise the script
---------------------

We need to import a few modules and utilities to play with the Software
Heritage API, namely ``json`` and the aforementioned ``requests``
modules. We also define a utility function to pretty-print json data
easily:

.. code:: python

    import json
    import requests

    # Utility to pretty-print json.
    def jprint(obj):
        # create a formatted string of the Python JSON object
        print(json.dumps(obj, sort_keys=True, indent=4))


The syntax mentioned in the :swh_web:`API documentation <api/1/>` is rather
straightforward. Since we want to read it from the main Software Heritage server, we
will use ``https://archive.softwareheritage.org/`` as the basename. All API calls will
be forged according to the same syntax:

``https://archive.softwareheritage.org/api/1/<endpoint>``

Request basic Information
-------------------------

We want to get some basic information about the main server activity and content. The
``stat`` endpoint provides a summary of the main indexes and some statistics about the
archive. We can request a GET on the main counters of the archive using the counters
path, as described in the :swh_web:`endpoint documentation <api/1/stat/counters/>`:

``/api/1/stat/counters/``

This API endpoint returns the following information:

* **content** is the total number of blobs (files) in the archive.
* **directory** is the total number of repositories in the archive.
* **origin** is the number of distinct origins (repositories) fetched by
  the archive bots.
* **origin_visits** is the total number of visits across all origins.
* **person** is the number of authors (e.g. committers, authors) in the
  archived files.
* **release** is the number of tags retrieved in the archive.
* **revision** is the number of revisions stored in the archive.
* **skipped_content** is the number of objects which could be
  imported in the archive.
* **snapshot** is the number of snapshots stored in the archive.

Note that we use the default JSON format for the output. We could use
YAML if we wanted to, with a custom ``Request Headers`` set to
``application/yaml``.

.. code-block:: python

    resp = requests.get("https://archive.softwareheritage.org/api/1/stat/counters/")
    counters = resp.json()
    jprint(counters)


.. code-block:: python

    {
        "content": 10049535736,
        "directory": 8390591308,
        "origin": 156388918,
        "person": 42263568,
        "release": 17218891,
        "revision": 2109783249
    }


There are almost 10bn blobs (aka files) in the archive and 8bn+
directories already, for 155m repositories analysed.

Now, what about a specific repository? Let’s say we want to find if
`alambic <https://alambic.io>`__ (an open-source data provider and
analysis system for software development) has already been analysed by
the archive’s bots.

Search the archive
------------------

Search for a keyword
^^^^^^^^^^^^^^^^^^^^

The easiest way to look for a keyword in the repositories analysed by the archive is to
use the ``search`` feature of the ``origin`` endpoint. Documentation for the endpoint is
:swh_web:`here <api/1/origin/search/doc/>` and the complete syntax is:

``/api/1/origin/search/<keyword>/``

The server returns an array of hashes, with each item being formatted
as:

-  **origin_visits_url** attribute is an URL that points to the API page
   listing all visits (bot fetches) to this repository.
-  **url** is the url of the origin, or repository, itself.

A (truncated) example of a result from this endpoint is shown below:

::

   [
     {
       "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/",
       "url": "https://github.com/borisbaldassari/alambic"
     }
     ...
   ]

As an example we will look for instances of *alambic* in the archive’s
analysed repositories::

    resp = requests.get("https://archive.softwareheritage.org/api/1/origin/search/alambic/")
    origins = resp.json()
    print(f"We found {len(origins)} entries.")
    for origin in origins[1:10]:
        print(f"- {origin['url']}")


Which produces::

    We found 52 entries.
    -  https://github.com/royal-alambic-club/sauron
    -  https://github.com/scamberlin/alambic
    -  https://github.com/WebTales/alambic-connector-mongodb
    -  https://github.com/WebTales/alambic
    -  https://github.com/AssoAlambic/alambic-website
    -  https://bitbucket.org/nayoub/alambic.git
    -  https://github.com/Alexandru-Dobre/alambic-connector-rest
    -  https://github.com/WebTales/alambic-connector-diffbot
    -  https://github.com/WebTales/alambic-connector-firebase


There are obviously many projects and repositories that embed the word
alambic, and we will need to be a bit more specific if we are to
identify the origin actually related to the alambic project.

If we want to know more about a specific origin, we can simply use the
``url`` attribute (or any known URL) as an entry for any of the
``origin`` endpoints.

Search for a specific origin
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now say that we want to query the database for the specific repository of Alambic, to
know what information has been registered by the archive. The API endpoint can be found
:swh_web:`in the swh-web documentation <api/1/origin/doc/>`, and has the following
syntax:

``/api/1/origin/<origin_url>/get/``

Which returns the same type of JSON object than the ``search`` command
seen previously:

-  **origin_visits_url** attribute is an URL that points to the API page
   listing all visits (bot fetches) to this repository.
-  **url** is the url of the origin, or repository, itself.

We know that Alambic is hosted at
‘https://github.com/borisbaldassari/alambic/’, so the API call will look
like this:

``/api/1/origin/https://github.com/borisbaldassari/alambic/get/``

.. code:: python

    resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/get/")
    found = resp.json()
    jprint(found)


.. code::

    {
        "origin_visits_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/",
        "url": "https://github.com/borisbaldassari/alambic"
    }


Get visits information
^^^^^^^^^^^^^^^^^^^^^^

We can use the ``origin_visits_url`` attribute to know more about when the repository
was analysed by the archive bots. The API endpoint is fully documented on the
:swh_web:`Software Heritage doc site <api/1/origin/visits/doc/>`, and has the following
syntax:

``/api/1/origin/<origin_url>/visits/``

We will use the same query as before about the main Alambic repository.

.. code:: python

    resp = requests.get("https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visits/")
    found = resp.json()
    length = len(found)
    print(f"Number of visits found: {format(length)}.")
    print("With dates:")
    for visit in found:
        print(f"- {visit['visit']} {visit['date']}")
    print("\nExample of a single visit entry:")
    jprint(found[0])


.. code::

    Number of visits found: 5.
    With dates:
    - 5 2021-01-01T19:35:41.308336+00:00
    - 4 2020-02-06T10:41:45.700641+00:00
    - 3 2019-09-01T22:38:12.056537+00:00
    - 2 2019-06-16T04:52:18.162914+00:00
    - 1 2019-01-30T07:19:20.799217+00:00

    Example of a single visit entry:
    {
        "date": "2021-01-01T19:35:41.308336+00:00",
        "metadata": {},
        "origin": "https://github.com/borisbaldassari/alambic",
        "origin_visit_url": "https://archive.softwareheritage.org/api/1/origin/https://github.com/borisbaldassari/alambic/visit/5/",
        "snapshot": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
        "snapshot_url": "https://archive.softwareheritage.org/api/1/snapshot/6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc/",
        "status": "full",
        "type": "git",
        "visit": 5
    }


Get the content
---------------

As defined in the beginning, a snapshot is a capture of the repository
at a given time with links to all branches and releases. In this example
we will work on the snapshot ID of the last visit to Alambic, as returned
by the previous command we executed.

.. code:: python

    # Store snapshot id
    snapshot = found[0]['snapshot']
    print(f"Snapshot is {format(snapshot)}.")


.. code::

    Snapshot is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc.


Note that the latest visit to the repository can also be directly retrieved using the
:swh_web:`dedicated endpoint <api/1/origin/visit/latest/doc/>`
``/api/1/origin/visit/latest/``.

Get the snapshot
^^^^^^^^^^^^^^^^

We want now to retrieve the content of the project at this snapshot. For that purpose
there is the ``snapshot`` endpoint, and its documentation is :swh_web:`provided here
<api/1/snapshot/doc/>`. The complete syntax is:

``/api/1/snapshot/<snapshot_id>/``

The snapshot endpoint returns in the ``branches`` attribute a list of **revisions** (aka
commits in a git context), which themselves point to the set of directories and files in
the branch at the time of analysis. Let’s follow this chain of links, starting with the
snapshot’s list of revisions (branches):

.. code:: python

    snapshotr = requests.get("https://archive.softwareheritage.org/api/1/snapshot/{}/".format(snapshot))
    snapshotj = snapshotr.json()
    jprint(snapshotj)


.. code::

    {
        "branches": {
            "HEAD": {
                "target": "refs/heads/master",
                "target_type": "alias",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
            },
            "refs/heads/devel": {
                "target": "e298b8c5692b18928013a68e41fd185419515075",
                "target_type": "revision",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/e298b8c5692b18928013a68e41fd185419515075/"
            },
            "refs/heads/features/cr152_anonymise_data": {
                "target": "ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162",
                "target_type": "revision",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/ba3e0dcbfa0cb212a7186e9e62efb6dafe7fe162/"
            },
            "refs/heads/features/cr164_github_project": {
                "target": "0005abb080e4c67a97533ee923e9d28142877752",
                "target_type": "revision",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/"
            },
            "refs/heads/features/cr165_github_its": {
                "target": "0005abb080e4c67a97533ee923e9d28142877752",
                "target_type": "revision",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/0005abb080e4c67a97533ee923e9d28142877752/"
            },
            "refs/heads/features/cr89_gitlabwizard": {
                "target": "b941fd5f93a6cfc2349358b891e47d0fffe0ed2d",
                "target_type": "revision",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/b941fd5f93a6cfc2349358b891e47d0fffe0ed2d/"
            },
            "refs/heads/master": {
                "target": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
                "target_type": "revision",
                "target_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
            }
        },
        "id": "6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc",
        "next_branch": null
    }


Get the root directory
^^^^^^^^^^^^^^^^^^^^^^

The revision associated to the branch can be retrieved by following the
corresponding link in the ``target_url`` attribute. We will follow the
``refs/heads/master`` branch and get the associated revision object. In
this case (a git repository) the revision is equivalent to a commit, with
an ID and message.

.. code:: python

    print(f"Revision ID is {snapshotj['id']}.")
    master_url = snapshotj['branches']['refs/heads/master']['target_url']
    masterr = requests.get(master_url)
    masterj = masterr.json()
    jprint(masterj)


.. code::

    Revision ID is 6436d2c9b06cf9bd9efb0b4e463c3fe6b868eadc
    {
        "author": {
            "email": "boris.baldassari@gmail.com",
            "fullname": "Boris Baldassari <boris.baldassari@gmail.com>",
            "name": "Boris Baldassari"
        },
        "committer": {
            "email": "boris.baldassari@gmail.com",
            "fullname": "Boris Baldassari <boris.baldassari@gmail.com>",
            "name": "Boris Baldassari"
        },
        "committer_date": "2020-11-01T12:55:13+01:00",
        "date": "2020-11-01T12:55:13+01:00",
        "directory": "fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8",
        "directory_url": "https://archive.softwareheritage.org/api/1/directory/fd9fe3477db3b9b7dea63509832b3fa99bdd7eb8/",
        "extra_headers": [],
        "history_url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/log/",
        "id": "6dd0504b43b4459d52e9f13f71a91cc0fc445a19",
        "merge": false,
        "message": "#163 Fix dygraphs zero padding in forums plugin.\n",
        "metadata": {},
        "parents": [
            {
                "id": "a4a2d8925c1cc43612602ac28e4ca9a31728b151",
                "url": "https://archive.softwareheritage.org/api/1/revision/a4a2d8925c1cc43612602ac28e4ca9a31728b151/"
            }
        ],
        "synthetic": false,
        "type": "git",
        "url": "https://archive.softwareheritage.org/api/1/revision/6dd0504b43b4459d52e9f13f71a91cc0fc445a19/"
    }


The revision references the root directory of the project. We can list all files and
directories at the root by requesting more information from the ``directory_url``
attribute. The endpoint is documented :swh_web:`here <api/1/directory/doc/>` and has the
following syntax:

``/api/1/directory/<directory_id>/``

The structure of the response is an **array of directory entries**.
**Content entries** are represented like this:

::

   {
       "checksums": {
           "sha1": "5973b582bfaeffa71c924e3fe7150620230391d8",
           "sha1_git": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
           "sha256": "8761f1e1fd96fc4c86ad343a7c19ecd51c0bde4d7055b3315c3975b31ec61bbc"
       },
       "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
       "length": 101,
       "name": ".dockerignore",
       "perms": 33188,
       "status": "visible",
       "target": "a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b",
       "target_url": "https://archive.softwareheritage.org/api/1/content/sha1_git:a6c4d5ebfdf88b3b1a65996f6c438c01bf60740b/",
       "type": "file"
   }

And **directory entries** are represented with:

::

   {
       "dir_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
       "length": null,
       "name": "doc",
       "perms": 16384,
       "target": "316468df4988351911992ecbf1866f1c1f575c23",
       "target_url": "https://archive.softwareheritage.org/api/1/directory/316468df4988351911992ecbf1866f1c1f575c23/",
       "type": "dir"
   }

We will print the list of contents and directories located at the root of
the repository at the time of analysis:

.. code:: python

    root_url = masterj['directory_url']
    rootr = requests.get(root_url)
    rootj = rootr.json()
    for f in rootj:
        print(f"- {f['name']}.")


.. code::

    - .dockerignore
    - .env
    - .gitignore
    - CODE_OF_CONDUCT.html
    - CODE_OF_CONDUCT.md
    - LICENCE.html
    - LICENCE.md
    - Readme.md
    - doc
    - docker
    - docker-compose.run.yml
    - docker-compose.test.yml
    - dockercfg.encrypted
    - mojo
    - resources


We could follow the links up (or down) to the leaves in order to rebuild
the project structure and download all files individually to rebuild the
project locally. However the archive can do it for us, and provides a
feature to download the content of a whole project in one step:
**cooking**. The feature is described in the :ref:`swh-vault
documentation <swh-vault>`.

Download content of a project
-----------------------------

When we ask the Archive to cook a directory for us, it invokes an
asynchronous job to recuversively fetch the directories and files of the
project, following the graph up to the leaves (files) and exporting the
result as a tar.gz file. This procedure is handled by the :ref:`swh-vault
component <swh-vault>`, and it’s all automatic.

Order the meal
^^^^^^^^^^^^^^

A cooking job can be invoked for revisions, directories or snapshots
(soon). It is initiated with a POST request on the ``vault/<type>/``
endpoint, and its complete syntax is:

``/api/1/vault/directory/<directory_id>/``

The first POST request initiates the cooking, and subsequent GET requests can fetch the
job result and download the archive. See the `Software Heritage documentation
<vault-primer>` on this, with useful examples. The API endpoint is documented
:swh_web:`here <api/1/vault/directory/doc/>`.

In this example we will fetch the content of the root directory that we
previously identified.

.. code:: python

    mealr = requests.post("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/")
    mealj = mealr.json()
    jprint(mealj)


.. code::

    {
        "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/",
        "id": 379321799,
        "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
        "obj_type": "directory",
        "progress_message": null,
        "status": "done"
    }


Ask if it’s ready
^^^^^^^^^^^^^^^^^

We can use a GET request on the same URL to get information about the
process status:

.. code:: python

    statusr = requests.get("https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/")
    statusj = statusr.json()
    jprint(statusj)


.. code::

    {
        "fetch_url": "https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/",
        "id": 379321799,
        "obj_id": "3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200",
        "obj_type": "directory",
        "progress_message": null,
        "status": "done"
    }


Get the plate
^^^^^^^^^^^^^

Once the processing is finished (it can take up to a few minutes) the
tar.gz archive can be downloaded through the ``fetch_url`` link, and
extracted as a tar.gz archive:

::

   boris@castalia:downloads$ curl https://archive.softwareheritage.org/api/1/vault/directory/3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/raw/ -o myarchive.tar.gz
     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
   100 9555k  100 9555k    0     0  1459k      0  0:00:06  0:00:06 --:--:-- 1717k
   boris@castalia:downloads$ ls
   myarchive.tar.gz
   boris@castalia:downloads$ tar xzf myarchive.tar.gz
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.dockerignore
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.env
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/.gitignore
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.html
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/CODE_OF_CONDUCT.md
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.html
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/LICENCE.md
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/Readme.md
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/Readme.md
   3ee1366c6dd0b7f4ba9536e9bcc300236ac8f200/doc/config
   [SNIP]

Conclusion
----------

In this article, we learned **how to explore and use the Software Heritage archive using
its API**: searching for a repository, identifying projects and downloading specific
snapshots of a repository. There is a lot more to the Archive and its API than what we
have seen, and all features are generously documented on the :swh_web:`Software Heritage
web site <api/>`.