Statsd metrics and Grafana dashboards#

This page lists all statsd metrics reported by Software Heritage’s components, and other metrics commonly used to monitor them

Archive#

  • sql_swh_archive_object_count

  • sql_swh_scheduler_delay

  • swh_archive_object_total

Journal#

  • swh_journal_client_handle_message_total

  • swh_journal_client_status

Client progress and status is monitored using the Kafka estimated time to completion <https://grafana.softwareheritage.org/d/Jayj4QsGk/kafka-estimated-time-to-completion> dashboard for a loader-specific view, and Kafka consumer lags <https://grafana.softwareheritage.org/d/KvQqUhsWz/kafka-consumers-lag> to show all consumers at once.

Indexers#

See RPC servers.

Loaders#

Filterered objects, ie. objects received by the loader that the archive already has (currently only reported by the Git loader):

  • swh_loader_filtered_objects_percent_bucket

  • swh_loader_filtered_objects_percent_count

  • swh_loader_filtered_objects_percent_sum

  • swh_loader_filtered_objects_total_count

  • swh_loader_filtered_objects_total_sum

Git references which are not loaded:

  • swh_loader_git_ignored_refs_percent_bucket

  • swh_loader_git_ignored_refs_percent_count

  • swh_loader_git_ignored_refs_percent_sum

  • swh_loader_git_known_refs_percent_bucket

  • swh_loader_git_known_refs_percent_count

  • swh_loader_git_known_refs_percent_sum

  • swh_loader_git_total

Metadata loading:

  • swh_loader_metadata_fetchers_count and swh_loader_metadata_fetchers_sum: the ratio is the average number of fetchers used by visit

  • swh_loader_metadata_objects_count: total number of metadata objects loaded

  • swh_loader_metadata_objects_sum

  • swh_loader_metadata_parent_origins_count and swh_loader_metadata_parent_origins_sum: the ratio is the average number of origins this origin is a fork of

Performance (all labeled with the name of an operation; and for the git loader, by whether they are incremental):

  • swh_loader_operation_duration_seconds_bucket

  • swh_loader_operation_duration_seconds_count

  • swh_loader_operation_duration_seconds_error_count

  • swh_loader_operation_duration_seconds_sum

Loader status is monitored through the Ingestion status and Loader metrics dashboards, which are focused respectively on loaded objects and loaders themselves.

Object storage#

In addition to RPC servers,

  • swh_objstorage_in_bytes_total

  • swh_objstorage_out_bytes_total

Outgoing requests#

All these metrics are labelled with api_type and api_instance, which should match values used for lister_name and lister_instance used elsewhere. Currently, it is only github for both. They are also labelled with username, which is either anonymous or the name of the user owning the token used to make the request.

  • swh_outbound_api_requests_total: total number of requests

  • swh_outbound_api_responses_total: total number of responses (excluding low-level failures: DNS, TCP, TLS, …), with http_status label

  • swh_outbound_api_remaining_requests: gauge of the value of X-Ratelimit-Remaining

  • swh_outbound_api_reset_seconds: gauge of the value of X-Ratelimit-Reset

  • swh_outbound_api_rate_limited_responses_total

  • swh_outbound_api_sleep_seconds_total: number of seconds spent waiting for rate limits to reset

Provenance#

  • swh_provenance_archive_direct_duration_seconds_bucket

  • swh_provenance_archive_direct_duration_seconds_count

  • swh_provenance_archive_direct_duration_seconds_error_count

  • swh_provenance_archive_direct_duration_seconds_sum

  • swh_provenance_archive_graph_duration_seconds_bucket

  • swh_provenance_archive_graph_duration_seconds_count

  • swh_provenance_archive_graph_duration_seconds_sum

  • swh_provenance_archive_multiplexed_duration_seconds_bucket

  • swh_provenance_archive_multiplexed_duration_seconds_count

  • swh_provenance_archive_multiplexed_duration_seconds_error_count

  • swh_provenance_archive_multiplexed_duration_seconds_sum

  • swh_provenance_archive_multiplexed_per_backend_count

  • swh_provenance_backend_duration_seconds_bucket

  • swh_provenance_backend_duration_seconds_count

  • swh_provenance_backend_duration_seconds_error_count

  • swh_provenance_backend_duration_seconds_sum

  • swh_provenance_backend_operations_total

  • swh_provenance_graph_duration_seconds_bucket

  • swh_provenance_graph_duration_seconds_count

  • swh_provenance_graph_duration_seconds_error_count

  • swh_provenance_graph_duration_seconds_sum

  • swh_provenance_origin_revision_layer_duration_seconds_bucket

  • swh_provenance_origin_revision_layer_duration_seconds_count

  • swh_provenance_origin_revision_layer_duration_seconds_error_count

  • swh_provenance_origin_revision_layer_duration_seconds_sum

  • swh_provenance_storage_postgresql_duration_seconds_bucket

  • swh_provenance_storage_postgresql_duration_seconds_count

  • swh_provenance_storage_postgresql_duration_seconds_error_count

  • swh_provenance_storage_postgresql_duration_seconds_sum

  • swh_provenance_storage_rabbitmq_duration_seconds_bucket

  • swh_provenance_storage_rabbitmq_duration_seconds_count

  • swh_provenance_storage_rabbitmq_duration_seconds_error_count

  • swh_provenance_storage_rabbitmq_duration_seconds_sum

Index of Provenance dashboards

Content and graph replayers#

  • swh_content_replayer_bytes

  • swh_content_replayer_duration_seconds_bucket

  • swh_content_replayer_duration_seconds_count

  • swh_content_replayer_duration_seconds_error_count

  • swh_content_replayer_duration_seconds_sum

  • swh_content_replayer_operations_total

  • swh_content_replayer_retries_total

  • swh_graph_replayer_duration_seconds_bucket

  • swh_graph_replayer_duration_seconds_count

  • swh_graph_replayer_duration_seconds_sum

  • swh_graph_replayer_operations_total

Dashboards:

RPC servers#

indexer_storage, objstorage, storage, search each report this set of metrics:

  • swh_<NAME>_request_duration_seconds_bucket

  • swh_<NAME>_request_duration_seconds_count

  • swh_<NAME>_request_duration_seconds_error_count

  • swh_<NAME>_request_duration_seconds_sum

indexer_storage, and search also have:

  • swh_<NAME>_operations_total

Scheduler#

  • swh_scheduler_listener_handled_event_total

  • swh_scheduler_origins_enabled

  • swh_scheduler_origins_known

  • swh_scheduler_origins_last_update

  • swh_scheduler_origins_never_visited

  • swh_scheduler_origins_with_pending_changes

  • swh_scheduler_runner_scheduled_task_total

  • swh_task_called_count

  • swh_task_duration_seconds_bucket

  • swh_task_duration_seconds_count

  • swh_task_duration_seconds_error_count

  • swh_task_duration_seconds_sum

  • swh_task_end_ts

  • swh_task_failure_count

  • swh_task_start_ts

  • swh_task_success_count

Scrubber#

Performance:

  • swh_scrubber_batch_duration_seconds_bucket

  • swh_scrubber_batch_duration_seconds_count

  • swh_scrubber_batch_duration_seconds_error_count

  • swh_scrubber_batch_duration_seconds_sum

  • swh_scrubber_objects_hashed_total

Corruptions found:

  • swh_scrubber_hash_mismatch_total

  • swh_scrubber_missing_object_total

Storage#

In addition to RPC servers,

  • swh_storage_operations_bytes_total, which reports the total number of content bytes going through the RPC server

Webapp#

  • swh_web_accepted_save_requests

  • swh_web_save_requests_delay_seconds

  • swh_web_submitted_save_requests

  • swh_web_submitted_save_requests_from_webhooks

Dashboard: Save Code Now

Other metrics#

Performance of end-to-end tests:

  • swh_e2e_duration_seconds

  • swh_e2e_status