Ninja Docs Help

Alert Runbook: Monitoring

Revision

Date

Description

1.0

24.07.2024

Init Changelog

Introduction

This article contains list of all alerts for monitoring stack.

Thanos

Sidecar

AlertName: \[ Thanos \]\[ Sidecar \] Bucket Operations Failed

  • Description: Thanos Sidecar bucket operations are failing.

  • Severity: Critical

  • Grafana:

    • Description: Node {{ $labels.node }} has been unready for a long time.

    • Summary: [ Kubernetes ] [ Node ] [ {{ $labels.kube_cluster }} ] Node {{ $labels.node }} is not ready

    • Metric: sum by (kube_cluster, node) (kube_node_status_condition{condition="Ready", status="true",job="kube-state-metrics"})

    • Reduce:

      • Input: Metric

      • Function: Last

      • Mode: Strict

    • Threshold: is below 1

  • Runbook: \[Thanos\]\[Sidecar\] Bucket Operations Failed

AlertName: \[ Thanos \]\[ Sidecar \] No connection to started Prometheus

  • Description: Thanos Sidecar is unhealthy.

  • Severity: Critical

  • Grafana:

    • Description: Thanos Sidecar {{ $labels.kube_cluster }} is unhealthy.

    • Summary: [ Thanos ][ Sidecar ][ {{ $labels.kube_cluster }} ] No connection to started Prometheus

    • Metric: thanos_sidecar_prometheus_up{job="thanos-sidecar"} == 0 AND on (kube_cluster) prometheus_tsdb_data_replay_duration_seconds{job="prometheus"} != 0

    • Reduce:

      • Input: Metric

      • Function: Last

      • Mode: Strict

    • Threshold: is above 0

  • Runbook: \[ Thanos \]\[ Sidecar \] No connection to started Prometheus

Query Frontend

AlertName: \[ Thanos \]\[ Query Frontend \] Replica missing

  • Description: Thanos Query Frontend has missing replica.

  • Severity: Critical

  • Grafana:

    • Description: Thanos Query Frontend has missing replica.

    • Summary: [ Thanos ][ Query Frontend ] Replica missing

    • Metric: sum(kube_deployment_status_replicas_ready{deployment=~"thanos-query-frontend"}) / sum(kube_deployment_spec_replicas{deployment="thanos-query-frontend"})

    • Reduce:

      • Input: Metric

      • Function: Last

      • Mode: Strict

    • Threshold: is below 1

  • Runbook: \[ Thanos \]\[ Query Frontend \] Replica missing

Query

AlertName: \[ Thanos \]\[ Query \] Replica missing

  • Description: Thanos Query has missing replica.

  • Severity: Critical

  • Grafana:

    • Description: Thanos Query has missing replica. Check on

    • Summary: [ Thanos ][ Query ] Replica missing

    • Metric: sum(kube_deployment_status_replicas_ready{deployment=~"thanos-query"}) / sum(kube_deployment_spec_replicas{deployment="thanos-query"})

    • Reduce:

      • Input: Metric

      • Function: Last

      • Mode: Strict

    • Threshold: is below 1

  • Runbook: \[ Thanos \]\[ Query \] Replica missing

AlertName: \[ Thanos \]\[ Query \] Http Request Query Error Rate High

  • Description: Thanos Query is failing to handle more than 5% of query requests.

  • Severity: Critical

  • Grafana:

    • Description: Thanos Query is failing to handle {{ $value }}% of \"query\" requests.

    • Summary: [ Thanos ][ Query ] Http Request Query Error Rate High

    • Metric: (sum by (kubernetes_name) (rate(http_requests_total{code=~"5..", kubernetes_name="thanos-query", handler="query"}[5m]))/ sum by (kubernetes_name) (rate(http_requests_total{kubernetes_name="thanos-query", handler="query"}[5m]))) * 100

    • Reduce:

      • Input: Metric

      • Function: Last

      • Mode: Replace Non-numeric Values

    • Threshold: is above 5

  • Runbook: \[ Thanos \]\[ Query \] Http Request Query Error Rate High

Last modified: 17 February 2025