Alert Runbook: Monitoring
Revision | Date | Description |
|---|---|---|
| 24.07.2024 | Init Changelog |
Introduction
This article contains list of all alerts for monitoring stack.
Thanos
Sidecar
AlertName: \[ Thanos \]\[ Sidecar \] Bucket Operations Failed
Description: Thanos Sidecar bucket operations are failing.
Severity:
CriticalGrafana:
Description:
Node {{ $labels.node }} has been unready for a long time.Summary:
[ Kubernetes ] [ Node ] [ {{ $labels.kube_cluster }} ] Node {{ $labels.node }} is not readyMetric:
sum by (kube_cluster, node) (kube_node_status_condition{condition="Ready", status="true",job="kube-state-metrics"})Reduce:
Input:
MetricFunction:
LastMode:
Strict
Threshold:
is below 1
AlertName: \[ Thanos \]\[ Sidecar \] No connection to started Prometheus
Description: Thanos Sidecar is unhealthy.
Severity:
CriticalGrafana:
Description:
Thanos Sidecar {{ $labels.kube_cluster }} is unhealthy.Summary:
[ Thanos ][ Sidecar ][ {{ $labels.kube_cluster }} ] No connection to started PrometheusMetric:
thanos_sidecar_prometheus_up{job="thanos-sidecar"} == 0 AND on (kube_cluster) prometheus_tsdb_data_replay_duration_seconds{job="prometheus"} != 0Reduce:
Input:
MetricFunction:
LastMode:
Strict
Threshold:
is above 0
Runbook: \[ Thanos \]\[ Sidecar \] No connection to started Prometheus
Query Frontend
AlertName: \[ Thanos \]\[ Query Frontend \] Replica missing
Description: Thanos Query Frontend has missing replica.
Severity:
CriticalGrafana:
Description:
Thanos Query Frontend has missing replica.Summary:
[ Thanos ][ Query Frontend ] Replica missingMetric:
sum(kube_deployment_status_replicas_ready{deployment=~"thanos-query-frontend"}) / sum(kube_deployment_spec_replicas{deployment="thanos-query-frontend"})Reduce:
Input:
MetricFunction:
LastMode:
Strict
Threshold:
is below 1
Query
AlertName: \[ Thanos \]\[ Query \] Replica missing
Description: Thanos Query has missing replica.
Severity:
CriticalGrafana:
Description:
Thanos Query has missing replica. Check onSummary:
[ Thanos ][ Query ] Replica missingMetric:
sum(kube_deployment_status_replicas_ready{deployment=~"thanos-query"}) / sum(kube_deployment_spec_replicas{deployment="thanos-query"})Reduce:
Input:
MetricFunction:
LastMode:
Strict
Threshold:
is below 1
AlertName: \[ Thanos \]\[ Query \] Http Request Query Error Rate High
Description: Thanos Query is failing to handle more than 5% of query requests.
Severity:
CriticalGrafana:
Description:
Thanos Query is failing to handle {{ $value }}% of \"query\" requests.Summary:
[ Thanos ][ Query ] Http Request Query Error Rate HighMetric:
(sum by (kubernetes_name) (rate(http_requests_total{code=~"5..", kubernetes_name="thanos-query", handler="query"}[5m]))/ sum by (kubernetes_name) (rate(http_requests_total{kubernetes_name="thanos-query", handler="query"}[5m]))) * 100Reduce:
Input:
MetricFunction:
LastMode:
Replace Non-numeric Values
Threshold:
is above 5
Runbook: \[ Thanos \]\[ Query \] Http Request Query Error Rate High