.. _architecture-overview: Software Architecture Overview ============================== From an end-user point of view, the |swh| platform consists in the :term:`archive`, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components/services that expose different aspects of the |swh| :term:`archive` as internal RPC APIs. These internal APIs have a dedicated database, usually PostgreSQL_. A global (and incomplete) view of this architecture looks like: .. thumbnail:: ../images/general-architecture.svg General view of the |swh| architecture. .. _architecture-tier-1: Core components --------------- The following components are the foundation of the entire |swh| architecture, as they fetch data, store it, and make it available to every other service. Data storage ^^^^^^^^^^^^ The :ref:`Storage ` provides an API to store and retrieve elements of the :ref:`graph `, such as directory structure, revision history, and their respective metadata. It relies on the :ref:`Object Storage ` service to store the content of source code file themselves. Both the Storage and Object Storage are designed as abstractions over possible backends. The former supports both PostgreSQL (the current solution in production) and Cassandra (a more scalable option we are exploring). The latter supports a large variety of "cloud" object storage as backends, as well as a simple local filesystem. Journal ^^^^^^^ The :term:`Journal `, which is a persistent logger of every change in the archive, with publish-subscribe_ support, using Kafka. The Storage publishes a kafka message in the journal each time a new object is added to the archive; and many components consumes them to be notified of these changes. For example, it allows the Scheduler to know when an origin has been visited and what was the resulting status of that visit, which helps to decide when to visit again these repositories. It is also the foundation of the :ref:`mirror` infrastructure, as it allows mirrors to stay up to date. Source code scraping ^^^^^^^^^^^^^^^^^^^^ The infrastructure aiming at finding new source code origins (git, mercurial and other type of VCS, source packages, etc.) and regularly visiting them is build around a few components based on a task scheduling scaffolding and using a Celery-based asynchronous task execution framework. The scheduler itself consists in 2 parts: a generic asynchronous task management system and a specific management database aiming at gathering and keeping up to date liveness information of listed origins that can be used to choose which of them should be visited in priority. To summarize, the parts involved in this carousel are: :term:`Listers `: tasks aiming at scraping a web site like a forge, etc. to gather all the source code repositories it can find, also known as :term:`origins `. Lister tasks are triggered by the scheduler, via Celery, and will fill the listed origins table of the listing and visit statistics database (see below). :term:`Loaders `: tasks dedicated to importing source code from a source code repository (an origin). It is the component that will insert :term:`blob` objects in the :term:`object storage`, and insert nodes and edges in the :ref:`graph `. :ref:`Scheduler `'s generic task management: manages the choreography of listing tasks in |swh|, as well as a few other utility tasks (save code now, deposit, vault, indexers). Note that this component will not handle the scheduling of loading tasks any more. It consists in a database and API allowing to define task types and to create tasks to be scheduled (recurring or one shot), as well as a tool (the ``scheduler-runner``) dedicated to spawn these tasks via the Celery asynchronous execution framework, as well as another tool (the ``scheduler-listener``) dedicated to keeping the scheduler database in sync with executed tasks (task execution status, execution timestamps, etc.). :ref:`Scheduler `'s listing and visit statistics: database and API allowing to store information about liveness of a listed origin as well as statistics about the loading of said origin. The visit statistics are updated from the main :ref:`storage ` kafka journal. :ref:`Scheduler `'s origin visit scheduling: tool that will use the statistics about listed origins and previous visits stored in the database to apply scheduling policies to select the next pool origins to visit. This does not use the generic task management system, but instead directly spawn loading Celery tasks. .. thumbnail:: ../images/lister-loader-scheduling-architecture.svg The Scheduler ~~~~~~~~~~~~~ The :ref:`Scheduler ` manages the generic choreography of jobs/tasks in |swh|, namely listing origins of software source code, loading them, extracting metadata from loaded origins and repackaging repositories into small downloadable archives for the :term:`Vault `. It consists in a database where all the scheduling information is stored, an API allowing unified access to this database, and a set of services and tools to orchestrate the actual scheduling of tasks. Their execution being delegated to a Celery-based set of asynchronous workers. While initially a single generic scheduling utility for all asynchronous task types, the scheduling of origin visits has now been extracted in a new, dedicated part of the Scheduler. These loading tasks used to be managed by this generic task scheduler as recurrent tasks, but the number of these loading tasks baceame a problem to handle then efficiently, as well as some of their specificities could not be accounted for to help better and more efficient scheduling of origin visits. There are now 2 parts in the scheduler: the original SWH Task management system, and the new Origin Visit scheduling utility. Both have a similar architecture at first sight: a database, an API, a celery based execution system. The main difference of the new visit-centric system it is dedicated to origin visits, and thus can use specific information and metadata on origins to optimise the scheduling policy; statstics about known origins resulting from the listing of a forge can be used as entry point for the scheduling of origin visits according to scheduling policies that can take several metrics into considerations, like: - have the origin already been visited, - if not, how "old" is the origin (what is the timestamp of its first sign of activity, e.g. creation date, timestamp of the first revision, etc.), - how long since the origin has last been visited, - how active is the origin (and thus how often it should be visited), - etc. For each new source code repository, a ``listed origin`` entry is added in the scheduler database, as well as the timestamp of last known activity for this origin as reported by the forge. For already known origins, only this last activity timestamp is updated, if need be. It is then the responsibility of the ``schedule-recurrent`` scheduler service to check listed origins, as well as visit statistics (see below), in order to regularly select the next origins to visit. This service also uses live data from Celery to choose an appropriate number of visits to schedule (keeping the Celery queues filled at a constant and controlled level). The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab_ forge, but any other supported source type would be very similar. .. thumbnail:: ../images/tasks-lister.svg As one might observe in this diagram, it does two things: - it asks the forge (a gitlab_ instance in this case) the list of known repositories as well as some metadata (especially last update timestamp), and - it inserts one ``listed origin`` for each new source code repository found or update the ``last update`` timestamp for the origin. The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar. .. thumbnail:: ../images/tasks-git-loader.svg .. _architecture-tier-2: Other major components ---------------------- All the components we saw above are critical to the |swh| archive as they are in charge of archiving source code. But are not enough to provide another important features of |swh|: making this archive accessible and searchable by anyone. Archive website and API ^^^^^^^^^^^^^^^^^^^^^^^ First of all, the archive website and API, also known as :ref:`swh-web `, is the main entry point of the archive. This is the component that serves https://archive.softwareheritage.org/, which is the window into the entire archive, as it provides access to it through a web browser or the HTTP API. It does so by querying most of the internal APIs of |swh|: the Data Storage (to display source code repositories and their content), the Scheduler (to allow manual scheduling of loader tasks through the :swh_web:`Save Code Now ` feature), and many of the other services we will see below. Internal data mining ^^^^^^^^^^^^^^^^^^^^ :term:`Indexers ` are a type of task aiming at crawling the content of the :term:`archive` to extract derived information. It ranges from detecting the MIME type or license of individual files, to reading all types of metadata files at the root of repositories and storing them together in a unified format, CodeMeta_. All results computed by Indexers are stored in a PostgreSQL database, the Indexer Storage. Vault ^^^^^ The :term:`Vault ` is an internal API, in charge of cooking compressed archive (zip or tgz) of archived objects on request (via swh-web). These compressed objects are typically directories or repositories. Since this can be a rather long process, it is delegated to an asynchronous (celery) task, through the Scheduler. .. _architecture-tier-3: Extra services -------------- Finally, |swh| provides additional tools that, although not necessary to operate the archive, provide convenient interfaces or performance benefits. It is therefore possible to have a fully-functioning archive without any of these services (our :ref:`development Docker environment ` disables most of these by default). Search ^^^^^^ The :ref:`swh-search ` service complements both the Storage and the Indexer Storage, to provide efficient advanced reverse-index search queries, such as full-text search on origin URLs and metadata. This service is a recent addition to the |swh| architecture based on ElasticSearch, and is currently in use only for URL search. Graph ^^^^^ :ref:`swh-graph ` is also a recent addition to the architecture designed to complement the Storage using a specialized backend. It leverages WebGraph_ to store a compressed in-memory representation of the entire graph, and provides fast implementations of graph traversal algorithms. Counters ^^^^^^^^ The :swh_web:`archive's landing page ` features counts of the total number of files/directories/revisions/... in the archive. Perhaps surprisingly, counting unique objects at |swh|'s scale is hard, and a performance bottleneck when implemented purely in the Storage's SQL database. :ref:`swh-counters ` provides an alternative design to solve this issue, by reading new objects from the Journal and counting them using Redis_' HyperLogLog_ feature; and keeps the history of these counters over time using Prometheus_. Deposit ^^^^^^^ The :ref:`Deposit ` is an alternative way to add content to the archive. While listers and loaders, as we saw above, **discover** repositories and **pull** artifacts into the archive, the Deposit allows trusted partners to **push** the content of their repository directly to the archive, and is internally loaded by the :mod:`Deposit Loader ` The Deposit is centered on the SWORDv2_ protocol, which allows depositing archives (usually TAR or ZIP) along with metadata in XML. The Deposit has its own HTTP interface, independent of swh-web. It also has its own SWORD client, which is specialized to interact with the Deposit server. Authentication ^^^^^^^^^^^^^^ While the archive itself is public, |swh| reserves some features to authenticated clients, such as higher rate limits, access to experimental APIs (currently: the Graph service), or the Deposit. This is managed centrally by :ref:`swh-auth ` using KeyCloak. Web Client, Fuse, Scanner ^^^^^^^^^^^^^^^^^^^^^^^^^ SWH provides a few tools to access the archive via the API: * :ref:`swh-web-client`, a command-line interface to authenticate with SWH and a library to access the API from Python programs * :ref:`swh-fuse`, a Filesystem in USErspace implementation, that exposes the entire archive as a regular directory on your computer * :ref:`swh-scanner`, a work-in-progress to check which of the files in a project are already in the archive, without submitting them Replayers and backfillers ^^^^^^^^^^^^^^^^^^^^^^^^^ As the Journal and various databases may be out of sync for various reasons (scrub of either of them, migration, database addition, ...), and because some databases need to follow the content of the Journal (mirrors), some places of the |swh| codebase contains tools known as "replayers" and "backfillers", designed to keep them in sync: * the :mod:`Object Storage Replayer ` copies the content of an objects storage to another one. It first performs a full copy, then streams new objects using the Journal to stay up to date * the Storage Replayer loads the entire content of the Journal into a Storage database, and also keeps them in sync. This is used for mirrors, and when creating a new database. * the Storage Backfiller, which does the opposite. This was initially used to populate the Journal from the database; and is occasionally when one needs to clear a topic in the Journal and recreate it. .. _celery: https://www.celeryproject.org .. _CodeMeta: https://codemeta.github.io/ .. _gitlab: https://gitlab.com .. _PostgreSQL: https://www.postgresql.org/ .. _Prometheus: https://prometheus.io/ .. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern .. _Redis: https://redis.io/ .. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html .. _HyperLogLog: https://redislabs.com/redis-best-practices/counting/hyperloglog/ .. _WebGraph: https://webgraph.di.unimi.it/