Statsd metrics and Grafana dashboards#
This page lists all statsd metrics reported by Software Heritage’s components, and other metrics commonly used to monitor them
Archive#
sql_swh_archive_object_count
sql_swh_scheduler_delay
swh_archive_object_total
Journal#
swh_journal_client_handle_message_total
swh_journal_client_status
Client progress and status is monitored using the Kafka estimated time to completion <https://grafana.softwareheritage.org/d/Jayj4QsGk/kafka-estimated-time-to-completion> dashboard for a loader-specific view, and Kafka consumer lags <https://grafana.softwareheritage.org/d/KvQqUhsWz/kafka-consumers-lag> to show all consumers at once.
Indexers#
See RPC servers.
Loaders#
Filterered objects, ie. objects received by the loader that the archive already has (currently only reported by the Git loader):
swh_loader_filtered_objects_percent_bucket
swh_loader_filtered_objects_percent_count
swh_loader_filtered_objects_percent_sum
swh_loader_filtered_objects_total_count
swh_loader_filtered_objects_total_sum
Git references which are not loaded:
swh_loader_git_ignored_refs_percent_bucket
swh_loader_git_ignored_refs_percent_count
swh_loader_git_ignored_refs_percent_sum
swh_loader_git_known_refs_percent_bucket
swh_loader_git_known_refs_percent_count
swh_loader_git_known_refs_percent_sum
swh_loader_git_total
Metadata loading:
swh_loader_metadata_fetchers_count
andswh_loader_metadata_fetchers_sum
: the ratio is the average number of fetchers used by visitswh_loader_metadata_objects_count
: total number of metadata objects loadedswh_loader_metadata_objects_sum
swh_loader_metadata_parent_origins_count
andswh_loader_metadata_parent_origins_sum
: the ratio is the average number of origins this origin is a fork of
Performance (all labeled with the name of an operation; and for the git loader, by whether they are incremental):
swh_loader_operation_duration_seconds_bucket
swh_loader_operation_duration_seconds_count
swh_loader_operation_duration_seconds_error_count
swh_loader_operation_duration_seconds_sum
Loader status is monitored through the Ingestion status and Loader metrics dashboards, which are focused respectively on loaded objects and loaders themselves.
Object storage#
In addition to RPC servers,
swh_objstorage_in_bytes_total
swh_objstorage_out_bytes_total
Outgoing requests#
All these metrics are labelled with api_type
and api_instance
, which
should match values used for lister_name
and lister_instance
used elsewhere.
Currently, it is only github
for both.
They are also labelled with username
, which is either anonymous
or the name of
the user owning the token used to make the request.
swh_outbound_api_requests_total
: total number of requestsswh_outbound_api_responses_total
: total number of responses (excluding low-level failures: DNS, TCP, TLS, …), withhttp_status
labelswh_outbound_api_remaining_requests
: gauge of the value ofX-Ratelimit-Remaining
swh_outbound_api_reset_seconds
: gauge of the value ofX-Ratelimit-Reset
swh_outbound_api_rate_limited_responses_total
swh_outbound_api_sleep_seconds_total
: number of seconds spent waiting for rate limits to reset
Provenance#
swh_provenance_archive_direct_duration_seconds_bucket
swh_provenance_archive_direct_duration_seconds_count
swh_provenance_archive_direct_duration_seconds_error_count
swh_provenance_archive_direct_duration_seconds_sum
swh_provenance_archive_graph_duration_seconds_bucket
swh_provenance_archive_graph_duration_seconds_count
swh_provenance_archive_graph_duration_seconds_sum
swh_provenance_archive_multiplexed_duration_seconds_bucket
swh_provenance_archive_multiplexed_duration_seconds_count
swh_provenance_archive_multiplexed_duration_seconds_error_count
swh_provenance_archive_multiplexed_duration_seconds_sum
swh_provenance_archive_multiplexed_per_backend_count
swh_provenance_backend_duration_seconds_bucket
swh_provenance_backend_duration_seconds_count
swh_provenance_backend_duration_seconds_error_count
swh_provenance_backend_duration_seconds_sum
swh_provenance_backend_operations_total
swh_provenance_graph_duration_seconds_bucket
swh_provenance_graph_duration_seconds_count
swh_provenance_graph_duration_seconds_error_count
swh_provenance_graph_duration_seconds_sum
swh_provenance_origin_revision_layer_duration_seconds_bucket
swh_provenance_origin_revision_layer_duration_seconds_count
swh_provenance_origin_revision_layer_duration_seconds_error_count
swh_provenance_origin_revision_layer_duration_seconds_sum
swh_provenance_storage_postgresql_duration_seconds_bucket
swh_provenance_storage_postgresql_duration_seconds_count
swh_provenance_storage_postgresql_duration_seconds_error_count
swh_provenance_storage_postgresql_duration_seconds_sum
swh_provenance_storage_rabbitmq_duration_seconds_bucket
swh_provenance_storage_rabbitmq_duration_seconds_count
swh_provenance_storage_rabbitmq_duration_seconds_error_count
swh_provenance_storage_rabbitmq_duration_seconds_sum
Content and graph replayers#
swh_content_replayer_bytes
swh_content_replayer_duration_seconds_bucket
swh_content_replayer_duration_seconds_count
swh_content_replayer_duration_seconds_error_count
swh_content_replayer_duration_seconds_sum
swh_content_replayer_operations_total
swh_content_replayer_retries_total
swh_graph_replayer_duration_seconds_bucket
swh_graph_replayer_duration_seconds_count
swh_graph_replayer_duration_seconds_sum
swh_graph_replayer_operations_total
Dashboards:
RPC servers#
indexer_storage
, objstorage
, storage
, search
each report this set of metrics:
swh_<NAME>_request_duration_seconds_bucket
swh_<NAME>_request_duration_seconds_count
swh_<NAME>_request_duration_seconds_error_count
swh_<NAME>_request_duration_seconds_sum
indexer_storage
, and search
also have:
swh_<NAME>_operations_total
Scheduler#
swh_scheduler_listener_handled_event_total
swh_scheduler_origins_enabled
swh_scheduler_origins_known
swh_scheduler_origins_last_update
swh_scheduler_origins_never_visited
swh_scheduler_origins_with_pending_changes
swh_scheduler_runner_scheduled_task_total
swh_task_called_count
swh_task_duration_seconds_bucket
swh_task_duration_seconds_count
swh_task_duration_seconds_error_count
swh_task_duration_seconds_sum
swh_task_end_ts
swh_task_failure_count
swh_task_start_ts
swh_task_success_count
Search#
See RPC servers.
Scrubber#
Performance:
swh_scrubber_batch_duration_seconds_bucket
swh_scrubber_batch_duration_seconds_count
swh_scrubber_batch_duration_seconds_error_count
swh_scrubber_batch_duration_seconds_sum
swh_scrubber_objects_hashed_total
Corruptions found:
swh_scrubber_hash_mismatch_total
swh_scrubber_missing_object_total
Storage#
In addition to RPC servers,
swh_storage_operations_bytes_total
, which reports the total number of content bytes going through the RPC server
Webapp#
swh_web_accepted_save_requests
swh_web_save_requests_delay_seconds
swh_web_submitted_save_requests
swh_web_submitted_save_requests_from_webhooks
Dashboard: Save Code Now
Other metrics#
Performance of end-to-end tests:
swh_e2e_duration_seconds
swh_e2e_status