Statsd metrics and Grafana dashboards#
This page lists all statsd metrics reported by Software Heritage’s components, and other metrics commonly used to monitor them
Archive#
sql_swh_archive_object_countsql_swh_scheduler_delayswh_archive_object_total
Journal#
swh_journal_client_handle_message_totalswh_journal_client_status
Client progress and status is monitored using the Kafka estimated time to completion <https://grafana.softwareheritage.org/d/Jayj4QsGk/kafka-estimated-time-to-completion> dashboard for a loader-specific view, and Kafka consumer lags <https://grafana.softwareheritage.org/d/KvQqUhsWz/kafka-consumers-lag> to show all consumers at once.
Indexers#
See RPC servers.
Loaders#
Filterered objects, ie. objects received by the loader that the archive already has (currently only reported by the Git loader):
swh_loader_filtered_objects_percent_bucketswh_loader_filtered_objects_percent_countswh_loader_filtered_objects_percent_sumswh_loader_filtered_objects_total_countswh_loader_filtered_objects_total_sum
Git references which are not loaded:
swh_loader_git_ignored_refs_percent_bucketswh_loader_git_ignored_refs_percent_countswh_loader_git_ignored_refs_percent_sumswh_loader_git_known_refs_percent_bucketswh_loader_git_known_refs_percent_countswh_loader_git_known_refs_percent_sumswh_loader_git_total
Metadata loading:
swh_loader_metadata_fetchers_countandswh_loader_metadata_fetchers_sum: the ratio is the average number of fetchers used by visitswh_loader_metadata_objects_count: total number of metadata objects loadedswh_loader_metadata_objects_sumswh_loader_metadata_parent_origins_countandswh_loader_metadata_parent_origins_sum: the ratio is the average number of origins this origin is a fork of
Performance (all labeled with the name of an operation; and for the git loader, by whether they are incremental):
swh_loader_operation_duration_seconds_bucketswh_loader_operation_duration_seconds_countswh_loader_operation_duration_seconds_error_countswh_loader_operation_duration_seconds_sum
Loader status is monitored through the Ingestion status and Loader metrics dashboards, which are focused respectively on loaded objects and loaders themselves.
Object storage#
In addition to RPC servers,
swh_objstorage_in_bytes_totalswh_objstorage_out_bytes_total
Outgoing requests#
All these metrics are labelled with api_type and api_instance, which
should match values used for lister_name and lister_instance used elsewhere.
Currently, it is only github for both.
They are also labelled with username, which is either anonymous or the name of
the user owning the token used to make the request.
swh_outbound_api_requests_total: total number of requestsswh_outbound_api_responses_total: total number of responses (excluding low-level failures: DNS, TCP, TLS, …), withhttp_statuslabelswh_outbound_api_remaining_requests: gauge of the value ofX-Ratelimit-Remainingswh_outbound_api_reset_seconds: gauge of the value ofX-Ratelimit-Resetswh_outbound_api_rate_limited_responses_totalswh_outbound_api_sleep_seconds_total: number of seconds spent waiting for rate limits to reset
Provenance#
swh_provenance_archive_direct_duration_seconds_bucketswh_provenance_archive_direct_duration_seconds_countswh_provenance_archive_direct_duration_seconds_error_countswh_provenance_archive_direct_duration_seconds_sumswh_provenance_archive_graph_duration_seconds_bucketswh_provenance_archive_graph_duration_seconds_countswh_provenance_archive_graph_duration_seconds_sumswh_provenance_archive_multiplexed_duration_seconds_bucketswh_provenance_archive_multiplexed_duration_seconds_countswh_provenance_archive_multiplexed_duration_seconds_error_countswh_provenance_archive_multiplexed_duration_seconds_sumswh_provenance_archive_multiplexed_per_backend_countswh_provenance_backend_duration_seconds_bucketswh_provenance_backend_duration_seconds_countswh_provenance_backend_duration_seconds_error_countswh_provenance_backend_duration_seconds_sumswh_provenance_backend_operations_totalswh_provenance_graph_duration_seconds_bucketswh_provenance_graph_duration_seconds_countswh_provenance_graph_duration_seconds_error_countswh_provenance_graph_duration_seconds_sumswh_provenance_origin_revision_layer_duration_seconds_bucketswh_provenance_origin_revision_layer_duration_seconds_countswh_provenance_origin_revision_layer_duration_seconds_error_countswh_provenance_origin_revision_layer_duration_seconds_sumswh_provenance_storage_postgresql_duration_seconds_bucketswh_provenance_storage_postgresql_duration_seconds_countswh_provenance_storage_postgresql_duration_seconds_error_countswh_provenance_storage_postgresql_duration_seconds_sumswh_provenance_storage_rabbitmq_duration_seconds_bucketswh_provenance_storage_rabbitmq_duration_seconds_countswh_provenance_storage_rabbitmq_duration_seconds_error_countswh_provenance_storage_rabbitmq_duration_seconds_sum
Content and graph replayers#
swh_content_replayer_bytesswh_content_replayer_duration_seconds_bucketswh_content_replayer_duration_seconds_countswh_content_replayer_duration_seconds_error_countswh_content_replayer_duration_seconds_sumswh_content_replayer_operations_totalswh_content_replayer_retries_totalswh_graph_replayer_duration_seconds_bucketswh_graph_replayer_duration_seconds_countswh_graph_replayer_duration_seconds_sumswh_graph_replayer_operations_total
Dashboards:
RPC servers#
indexer_storage, objstorage, storage, search
each report this set of metrics:
swh_<NAME>_request_duration_seconds_bucketswh_<NAME>_request_duration_seconds_countswh_<NAME>_request_duration_seconds_error_countswh_<NAME>_request_duration_seconds_sum
indexer_storage, and search also have:
swh_<NAME>_operations_total
Scheduler#
swh_scheduler_listener_handled_event_totalswh_scheduler_origins_enabledswh_scheduler_origins_knownswh_scheduler_origins_last_updateswh_scheduler_origins_never_visitedswh_scheduler_origins_with_pending_changesswh_scheduler_runner_scheduled_task_totalswh_task_called_countswh_task_duration_seconds_bucketswh_task_duration_seconds_countswh_task_duration_seconds_error_countswh_task_duration_seconds_sumswh_task_end_tsswh_task_failure_countswh_task_start_tsswh_task_success_count
Search#
See RPC servers.
Scrubber#
Performance:
swh_scrubber_batch_duration_seconds_bucketswh_scrubber_batch_duration_seconds_countswh_scrubber_batch_duration_seconds_error_countswh_scrubber_batch_duration_seconds_sumswh_scrubber_objects_hashed_total
Corruptions found:
swh_scrubber_hash_mismatch_totalswh_scrubber_missing_object_total
Storage#
In addition to RPC servers,
swh_storage_operations_bytes_total, which reports the total number of content bytes going through the RPC server
Webapp#
swh_web_accepted_save_requestsswh_web_save_requests_delay_secondsswh_web_submitted_save_requestsswh_web_submitted_save_requests_from_webhooks
Dashboard: Save Code Now
Other metrics#
Performance of end-to-end tests:
swh_e2e_duration_secondsswh_e2e_status