prometheus 错误告警 kube-scheduler TargetDown

问题概述

在 Rancher Monitoring 中看到 kube-scheduler TargetDown 告警,并且相关指标为 0:

1
2
3
up{endpoint="http-metrics", instance="10.x.x.61:10259", job="kube-scheduler", namespace="kube-system", pod="kube-scheduler-node-1", service="rancher-monitoring-kube-scheduler"} | 0
up{endpoint="http-metrics", instance="10.x.x.61:10259", job="kube-scheduler", namespace="kube-system", pod="kube-scheduler-node-2", service="rancher-monitoring-kube-scheduler"} | 0
up{endpoint="http-metrics", instance="10.x.x.11:10259", job="kube-scheduler", namespace="kube-system", pod="kube-scheduler-node-3", service="rancher-monitoring-kube-scheduler"} | 0

环境如下:

Env Version
Rancher 2.9.2
monitoring rancher-monitoring-104.1.1_up57.0.3

排查过程

  1. 先检查 kube-scheduler Pod 状态:
1
kubectl get pods -n kube-system -l component=kube-scheduler
  1. 检查 Rancher Monitoring 正常使用的 kube-scheduler 抓取链路:
1
up{component="kube-scheduler"}

返回结果如下,可以看到,通过 pushprox-kube-scheduler-client 采集的 kube-scheduler 指标是正常的。

1
2
3
up{component="kube-scheduler", endpoint="metrics", instance="10.x.x.61:10259", job="kube-scheduler", namespace="cattle-monitoring-system", pod="pushprox-kube-scheduler-client-8qk86", service="pushprox-kube-scheduler-client"} | 1
up{component="kube-scheduler", endpoint="metrics", instance="10.x.x.11:10259", job="kube-scheduler", namespace="cattle-monitoring-system", pod="pushprox-kube-scheduler-client-xgjl6", service="pushprox-kube-scheduler-client"} | 1
up{component="kube-scheduler", endpoint="metrics", instance="10.x.x.61:10259", job="kube-scheduler", namespace="cattle-monitoring-system", pod="pushprox-kube-scheduler-client-ghk6d", service="pushprox-kube-scheduler-client"} | 1
  1. 由于告警中的 servicerancher-monitoring-kube-scheduler,而正常抓取链路中的 servicepushprox-kube-scheduler-client,继续导出所有 ServiceMonitor,查找异常抓取目标来源:
1
kubectl get servicemonitors.monitoring.coreos.com -A -oyaml > all-servicemonitors.yaml

最终找到如下 ServiceMonitor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: rancher-monitoring
meta.helm.sh/release-namespace: cattle-monitoring-system
creationTimestamp: "2025-08-18T09:51:39Z"
labels:
app: rancher-monitoring-kube-scheduler
app.kubernetes.io/instance: rancher-monitoring
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: rancher-monitoring
app.kubernetes.io/version: 104.1.1_up57.0.3
chart: rancher-monitoring-104.1.1_up57.0.3
release: rancher-monitoring
name: rancher-monitoring-kube-scheduler
namespace: kube-system
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
port: http-metrics
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecureSkipVerify: true
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app: rancher-monitoring-kube-scheduler
release: rancher-monitoring

这个 ServiceMonitor 抓取 kube-scheduler10259 端口。RKE2 新版本中 Rancher Monitoring 默认通过 pushprox-kube-scheduler-client 获取 kube-scheduler 指标,因此这里的 ServiceMonitor 不应该再启用。

问题原因

RKE2 新版本的 Rancher Monitoring 使用 .rke2Scheduler.enabled=true 通过 pushprox-kube-scheduler-client 采集 kube-scheduler 指标。

如果同时启用了 .kubeScheduler.enabled=true,导致 chart 创建了旧的 rancher-monitoring-kube-scheduler ServiceMonitor。该 ServiceMonitor 直接访问 kube-systemkube-scheduler10259 端口,默认情况下无法正常访问,所以 Prometheus 里出现了 TargetDown

实际 kube-scheduler Pod 和 pushprox-kube-scheduler-client 抓取链路都是正常的,因此这是错误抓取目标引起的误报。

解决方法

在 Rancher Monitoring values 中关闭旧的 kubeScheduler 配置,保留 RKE2 使用的 rke2Scheduler 配置:

1
2
3
4
5
rke2Scheduler:
enabled: true

kubeScheduler:
enabled: false

如果修改后 ServiceMonitor 没有自动删除,可以手动删除:

1
kubectl delete servicemonitors.monitoring.coreos.com -n kube-system rancher-monitoring-kube-scheduler