Software Heritage Journal clients#
Journal client are processes that read data from the Software Heritage Journal, in order to efficiently process all existing objects, and process new objects as they come.
Some journal clients, such as swh-dataset only read existing objects and stop when they are done.
Other journal clients, such as the mirror are expected to read constantly from the journal.
They can run in parallel, and the swh.journal.client
module provides an
abstraction handling all the setup, so actual clients only consists in a single
function that takes model objects
as parameters.
For example, a very simple journal client that prints all revisions and releases to the console can be implemented like this:
import pprint
def process_objects(all_objects):
"""Worker function handling incoming objects"""
for object_type, objects in all_objects.items():
for object_ in objects:
print(f"New {object_type} object:")
pprint.pprint(object_)
print()
def main():
from swh.journal.client import get_journal_client
# Usually read from a config file:
config = {
"brokers": ["localhost:9092"],
"group_id": "my-consumer-group",
"auto_offset_reset": "earliest",
}
# Initialize the client
client = get_journal_client(
"kafka", object_types=["revision", "release"], privileged=True, **config
)
try:
# Run the client forever
client.process(process_objects)
except KeyboardInterrupt:
print("Called Ctrl-C, exiting.")
exit(0)
if __name__ == "__main__":
main()
Parallelization#
A single journal client, like the one above, is sequential. It can however run concurrently by running the same program multiple times. Kafka will coordinate the processes so the load is shared across processes.
Authentication#
In production, journal clients need credentials to access the journal.
Once you have credentials, they can be configured by adding this to the config
:
config = {
"sasl.mechanism": "SCRAM-SHA-512",
"security.protocol": "SASL_SSL",
"sasl.username": "<username>",
"sasl.password": "<password>",
}
There are two types of client: privileged and unprivileged.
The former has access to all the data, the latter gets redacted authorship information,
for privacy reasons.
Instead, the name
and email
fields of author
and committer
attributes
of release and revision objects are blank, and their fullname
is a SHA256 hash
of their actual fullname.
The privileged
parameter to get_journal_client
must be set accordingly.
Order guarantees and replaying#
The journal client shares the ordering guarantees of Kafka. The short version is that you should not assume any order unless specified otherwise in the Kafka documentation, nor that two related objects are sent to the same process.
We call “replay” any workflow that involves a journal client writing all (or most) objects to a new database. This can be either continuous (in effect, this produces a mirror database), or one-off.
Either way, particular attention should be given to this lax ordering, as replaying produces databases that are (temporarily) inconsistent, because some objects may point to objects that are not replayed yet.
For one-off replays, this can be mostly solved by processing objects in reverse topologic order: as contents don’t reference any object, directories only reference contents and directories, revisions only reference directories, etc. ; this means that replayers can first process all revisions, then all directories, then all contents. This keeps the number of inconsistencies relatively small.
For continuous replays, replayed databases are eventually consistent.