K8s环境下普罗米修斯的规则配置

2020年10月09日留言

想要将 Watchdog 告警去掉(Taskon服务的，每4h抛一次，用于整体探，告诉我这个服务还在)

[FIRING:1] Watchdog (monitoring/prometheus-kube-prometheus-prometheus none)
This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example the
"DeadMansSnitch" integration in PagerDuty.
 https://runbooks.prometheus-operator.dev/runbooks/general/watchdog An alert that should always be firing to certify that Alertmanager is working properly.
Alerts Firing:
Labels:
 - alertname = Watchdog
 - prometheus = monitoring/prometheus-kube-prometheus-prometheus
 - severity = none
Annotations:
 - description = This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example the
"DeadMansSnitch" integration in PagerDuty.

 - runbook_url = https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
 - summary = An alert that should always be firing to certify that Alertmanager is working properly.
Source: http://prometheus-kube-prometheus-prometheus.monitoring:9090/graph?g0.expr=vector%281%29&g0.tab=1


AlertmanagerUrl:
http://prometheus-kube-prometheus-alertmanager.monitoring:9093/#/alerts?receiver=wechat

现在想去掉这个告警。和二进制部署的prometheus不同，K8s环境下有点麻烦

k -n monitoring get PrometheusRule

NAME                                                              AGE
mysql-connection-rules                                            122d
prometheus-kube-prometheus-alertmanager.rules                     122d
prometheus-kube-prometheus-config-reloaders                       122d
prometheus-kube-prometheus-etcd                                   122d
prometheus-kube-prometheus-general.rules                          122d
prometheus-kube-prometheus-k8s.rules                              122d
prometheus-kube-prometheus-kube-apiserver-availability.rules      122d
prometheus-kube-prometheus-kube-apiserver-burnrate.rules          122d
prometheus-kube-prometheus-kube-apiserver-histogram.rules         122d
prometheus-kube-prometheus-kube-apiserver-slos                    122d
prometheus-kube-prometheus-kube-prometheus-general.rules          122d
prometheus-kube-prometheus-kube-prometheus-node-recording.rules   122d
prometheus-kube-prometheus-kube-state-metrics                     122d
prometheus-kube-prometheus-kubelet.rules                          122d
prometheus-kube-prometheus-kubernetes-apps                        122d
prometheus-kube-prometheus-kubernetes-resources                   122d
prometheus-kube-prometheus-kubernetes-storage                     122d
prometheus-kube-prometheus-kubernetes-system                      122d
prometheus-kube-prometheus-kubernetes-system-apiserver            122d
prometheus-kube-prometheus-kubernetes-system-kubelet              122d
prometheus-kube-prometheus-node-exporter                          122d
prometheus-kube-prometheus-node-exporter.rules                    122d
prometheus-kube-prometheus-node-network                           122d
prometheus-kube-prometheus-node.rules                             122d
prometheus-kube-prometheus-prometheus                             122d
prometheus-kube-prometheus-prometheus-operator                    122d
redis-rules                                                       39d
twitter-validator-rule                                            122d
ubuntu@VM-0-22-ubuntu:~$

命令 “k -n monitoring get PrometheusRule” 是一个使用 Kubernetes 命令行工具（kubectl）查询 Kubernetes 集群中的监控规则（PrometheusRule）资源的命令。

解析该命令的具体含义如下：

“k”：是 “kubectl” 命令的简写形式，用于与 Kubernetes 集群进行交互。
“-n monitoring”：是一个命令行参数，指定要在名为 “monitoring” 的命名空间中执行操作。命名空间是 Kubernetes 中用于逻辑隔离和组织资源的一种机制。
“get”：是一个 kubectl 命令，用于获取指定资源的信息。
“PrometheusRule”：是要获取的资源类型，它表示 Prometheus 监控规则。PrometheusRule 是用于配置和定义 Prometheus 监控系统中的告警规则的 Kubernetes 资源类型。

因此，该命令的目的是在名为 “monitoring” 的命名空间中获取所有的 PrometheusRule 资源的信息。执行该命令后，kubectl 将返回 PrometheusRule 资源的详细信息，包括其名称、标签、配置规则等。这些信息可以用于监控和管理 Prometheus 监控系统中的告警规则。

如何查看watchdog这个规则配置，在哪个里面？

要查看名为 “watchdog” 的规则配置在哪个文件或位置中，需要进一步了解你使用的具体监控系统或工具。”watchdog” 可能是一个自定义的规则名称，它的具体配置位置可能因不同的监控系统而异。

通常，Prometheus 监控系统中的规则配置通常存储在 Prometheus 的配置文件中或者通过 Kubernetes 的 PrometheusRule 资源来定义。以下是两种常见的情况：

Prometheus 配置文件：在 Prometheus 的配置文件（通常为 prometheus.yml）中，可以指定规则文件的位置。在该文件中，可以通过 rule_files 配置项指定一个或多个规则文件的路径。可以在配置文件中搜索 rule_files 配置项，查看是否有指定的规则文件路径，然后定位到该文件并查找名为 “watchdog” 的规则配置。

Kubernetes 的 PrometheusRule 资源：如果你在 Kubernetes 集群中使用 Prometheus Operator 或类似的工具来管理 Prometheus 监控系统，那么规则配置通常会以 PrometheusRule 资源的形式存在于 Kubernetes 中。你可以使用以下命令来获取名为 “watchdog” 的 PrometheusRule 资源的详细信息：

kubectl -n <namespace> get PrometheusRule watchdog -o yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

   其中，`<namespace>` 是 PrometheusRule 所在的命名空间。该命令将返回名为 "watchdog" 的 PrometheusRule 资源的 YAML 配置。你可以查看该配置中的规则定义部分，以了解 "watchdog" 规则的具体配置。

请注意，以上是一些常见的情况，实际的配置位置可能因具体的监控系统和部署方式而有所不同。如果以上方法无法找到 "watchdog" 规则的配置，请提供更多关于你使用的监控系统或工具的信息，以便我能够提供更准确的帮助。



<br>


可以先把全部PrometheusRule的导到一个文件里，

`k -n monitoring get PrometheusRule -o yaml > all.yaml`


然后

` cat all.yaml | grep  -C 20 Watch `

发现是

*name: prometheus-kube-prometheus-general.rules*



`k -n monitoring get PrometheusRule  prometheus-kube-prometheus-general.rules -o yaml`

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-name: prometheus
    meta.helm.sh/release-namespace: monitoring
    prometheus-operator-validated: "true"
  creationTimestamp: "2023-06-09T06:49:01Z"
  generation: 2
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 46.8.0
    chart: kube-prometheus-stack-46.8.0
    heritage: Helm
    release: prometheus
  name: prometheus-kube-prometheus-general.rules
  namespace: monitoring
  resourceVersion: "9619465346"
  uid: 5802912f-2787-4882-ab80-c8c70345e040
spec:
  groups:
  - name: general.rules
    rules:
    - alert: TargetDown
      annotations:
        description: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service
          }} targets in {{ $labels.namespace }} namespace are down.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/targetdown
        summary: One or more targets are unreachable.
      expr: 100 * (count(up == 0) BY (cluster, job, namespace, service) / count(up)
        BY (cluster, job, namespace, service)) > 10
      for: 10m
      labels:
        cluster: newtaskon
        severity: warning
    - alert: Watchdog
      annotations:
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
        summary: An alert that should always be firing to certify that Alertmanager
          is working properly.
      expr: vector(1)
      labels:
        severity: none

上面这段配置的意思是:

在上面的规则中，您提到了一个名为”Watchdog”的警报。这个警报旨在确保整个报警流程正常运行。该警报始终处于触发状态，因此它应该始终在Alertmanager中触发，并针对一个接收器触发。与各种通知机制集成，当此警报未触发时发送通知。例如，PagerDuty中的”DeadMansSnitch”集成。该警报表达式为vector(1)，表示始终返回一个值。它的标签中没有设置严重性（severity）标签，因此可以认为它的严重性为”none”，即不具有严重性。

请注意，这些信息是基于您提供的Prometheus规则的配置。具体的标签和注释可能会因您的环境和配置而有所不同。

vector(1) 在普罗米修斯中什么意思？

在Prometheus中，vector(1) 表示一个包含单个样本的向量。向量是Prometheus中的一种数据类型，它表示在给定时间点上的样本集合。

在Prometheus中，监控指标是通过时间序列（time series）来表示的。时间序列是一系列具有相同指标名称和一组标签（labels）的样本，每个样本由一个时间戳和对应的数值组成。

vector(1) 表示只包含一个样本的向量，该样本的值为1。这意味着在给定时间点上，只有一个样本的值为1。这个向量可以在Prometheus的查询语言中使用，进行各种计算和聚合操作。

例如，可以将 vector(1) 与其他时间序列进行加法、减法、乘法或除法运算，以生成新的时间序列。它还可以与其他向量函数（例如 sum()、avg() 等）一起使用，进行更复杂的聚合操作。

需要注意的是，vector(1) 只是一个示例，实际使用中，向量可以包含多个样本，并具有不同的数值和时间戳，以表示更多的监控指标数据。

当前问题在Resolved后，没有发送Resolved的消息。

对于二进制部署的方式，是在alertmanager的alertmanager.yml中，相应的告警下方添加

send_resolved: true

对于k8s方式部署的监控告警，如何设置呢？

在 Kubernetes 中使用 Prometheus Operator 或类似的工具部署的监控告警，可以通过 PrometheusRule 资源来设置 send_resolved 选项。send_resolved 用于指定在问题解决后是否发送 “Resolved” 的消息。

要设置 send_resolved 选项，可以在 PrometheusRule 的规则配置中添加一个附加的注释标签（annotation）。这个注释标签的键为 prometheus.io/alertmanager，值为一个包含 send_resolved: true 的 YAML 字符串。这样，当问题被解决后，Alertmanager 将会发送 “Resolved” 的消息。

以下是在 PrometheusRule 中设置 send_resolved 的示例：

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-rule
  namespace: monitoring
spec:
  groups:
  - name: example-group
    rules:
    - alert: ExampleAlert
      annotations:
        prometheus.io/alertmanager: |
          {
            "send_resolved": "true"
          }
      expr: <alert_expression>
      for: 5m

在上述示例中，annotations 部分包含了用于告警的注释标签。prometheus.io/alertmanager 标签的值是一个包含 send_resolved: true 的 JSON 字符串。这将告诉 Alertmanager 在问题解决后发送 “Resolved” 的消息。

请注意，具体的配置方式可能因使用的监控工具或配置方式而有所不同。上述示例是一种常见的设置方法，但在实际部署中可能需要根据具体情况进行调整。确保根据你的监控系统和部署方式参考相关文档进行正确的配置。

20250602

[FIRING:1] KubeletPlegDurationHigh (172.22.0.22:10250 172.22.0.22 monitoring/prometheus-kube-prometheus-prometheus 0.99 warning)
The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of 10 seconds on node 172.22.0.22. https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh Kubelet Pod Lifecycle Event Generator is taking too long to relist.
Alerts Firing:
Labels:

alertname = KubeletPlegDurationHigh
instance = 172.22.0.22:10250
node = 172.22.0.22
prometheus = monitoring/prometheus-kube-prometheus-prometheus
quantile = 0.99
severity = warning
Annotations:
description = The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of 10 seconds on node 172.22.0.22.
runbook_url = https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh
summary = Kubelet Pod Lifecycle Event Generator is taking too long to relist.
Source: http://prometheus-kube-prometheus-prometheus.monitoring:9090/graph?g0.expr=node_quantile%3Akubelet_pleg_relist_duration_seconds%3Ahistogram_quantile%7Bquantile%3D%220.99%22%7D+%3E%3D+10&g0.tab=1

AlertmanagerUrl:
http://prometheus-kube-prometheus-alertmanager.monitoring:9093/#/alerts?receiver=wechat

这个 KubeletPlegDurationHigh 告警表示 Kubernetes 节点上的 kubelet 组件出现了性能问题。让我为你详细解释：

问题分析

PLEG (Pod Lifecycle Event Generator) 是 kubelet 的一个重要组件，负责：

定期检查容器运行时状态
生成 Pod 生命周期事件
维护 Pod 状态的准确性

当前问题：

节点 172.22.0.22 上的 PLEG 重新列举（relist）操作的 99% 分位数延迟达到了 10 秒
这远超过正常范围（通常应该在几百毫秒内）

可能原因

容器运行时响应慢
- Docker/containerd 性能问题
- 存储 I/O 瓶颈
节点资源不足
- CPU 使用率过高
- 内存压力
- 磁盘 I/O 饱和
Pod 数量过多
- 节点上运行的 Pod 数量超出合理范围
- 容器数量过多
网络问题
- 容器网络配置问题
- CNI 插件性能问题

排查步骤

检查节点资源使用情况

1 2	kubectl top node 172.22.0.22 kubectl describe node 172.22.0.22

检查该节点上的 Pod 数量

1	kubectl get pods --all-namespaces --field-selector spec.nodeName=172.22.0.22

检查 kubelet 日志

1	journalctl -u kubelet -f

检查容器运行时状态

1	systemctl status docker # 或 containerd

解决建议

优化资源配置
- 增加节点 CPU/内存资源
- 优化磁盘 I/O 性能
调整 kubelet 参数
- 增加 --pod-max-pids 限制
- 调整 --serialize-image-pulls=false
减少节点负载
- 迁移部分 Pod 到其他节点
- 清理不必要的容器和镜像
监控和调优
- 持续监控 PLEG 延迟
- 定期清理容器运行时缓存

这个问题如果不及时处理，可能导致 kubelet 无法及时响应 API Server 请求，影响 Pod 调度和状态更新。

ubuntu@VM-0-22-ubuntu:$ kubectl top node 172.22.0.22
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
172.22.0.22 242m 6% 1669Mi 30%
ubuntu@VM-0-22-ubuntu:$ kubectl describe node 172.22.0.22
Name: 172.22.0.22
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=SA2.LARGE8
beta.kubernetes.io/os=linux
cloud.tencent.com/node-instance-id=ins-nd41mwy2
failure-domain.beta.kubernetes.io/region=sg
failure-domain.beta.kubernetes.io/zone=900003
kubernetes.io/arch=amd64
kubernetes.io/hostname=172.22.0.22
kubernetes.io/os=linux
node.kubernetes.io/instance-type=SA2.LARGE8
topology.com.tencent.cloud.csi.cbs/zone=ap-singapore-3
topology.kubernetes.io/region=sg
topology.kubernetes.io/zone=900003
Annotations: csi.volume.kubernetes.io/nodeid: {“com.tencent.cloud.csi.cbs”:”ins-nd41mwy2”}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 05 Jun 2023 09:50:38 +0800
Taints:
Unschedulable: false
Lease:
HolderIdentity: 172.22.0.22
AcquireTime:
RenewTime: Mon, 02 Jun 2025 10:14:21 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message

NetworkUnavailable False Mon, 05 Jun 2023 09:50:42 +0800 Mon, 05 Jun 2023 09:50:42 +0800 RouteCreated RouteController created a route
MemoryPressure False Mon, 02 Jun 2025 10:10:58 +0800 Wed, 16 Oct 2024 20:14:34 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 02 Jun 2025 10:10:58 +0800 Fri, 07 Jun 2024 03:11:17 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 02 Jun 2025 10:10:58 +0800 Mon, 06 May 2024 10:05:19 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 02 Jun 2025 10:10:58 +0800 Mon, 02 Jun 2025 10:05:57 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 172.22.0.22
ExternalIP: 43.153.222.192
Hostname: 172.22.0.22
Capacity:
cpu: 4
ephemeral-storage: 103097996Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7623360Ki
pods: 253
Allocatable:
cpu: 3900m
ephemeral-storage: 95015112957
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 5633728Ki
pods: 253
System Info:
Machine ID: 19b5978403fd40eb888d2ffa4c968153
System UUID: 19b59784-03fd-40eb-888d-2ffa4c968153
Boot ID: 5c3e0572-36e9-48c3-93b8-8a94fa2c20f5
Kernel Version: 5.4.0-139-generic
OS Image: Ubuntu 20.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.9-tke.1
Kubelet Version: v1.22.5-tke.16
Kube-Proxy Version: v1.22.5-tke.16
PodCIDR: 10.16.3.0/24
PodCIDRs: 10.16.3.0/24
ProviderID: qcloud:///900003/ins-nd41mwy2
Non-terminated Pods: (19 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age

ingress-nginx ingress-nginx-controller-8c675bd6c-vc787 100m (2%) 0 (0%) 90Mi (1%) 0 (0%) 243d
kube-system cls-provisioner-75df9b4649-hgfmb 100m (2%) 100m (2%) 64M (1%) 64M (1%) 353d
kube-system csi-cbs-node-5dssr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 728d
kube-system ip-masq-agent-tpgqg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 728d
kube-system kube-proxy-8c48l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 728d
kube-system l7-lb-controller-7bffbb7f6f-5htzl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 420d
kube-system tke-bridge-agent-wbk72 0 (0%) 0 (0%) 0 (0%) 0 (0%) 728d
kube-system tke-cni-agent-b9mlz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 728d
kube-system tke-log-agent-gzbnd 210m (5%) 2100m (53%) 124M (2%) 1360M (23%) 723d
kube-system tke-monitor-agent-x5cfl 10m (0%) 100m (2%) 30Mi (0%) 100Mi (1%) 647d
monitoring prometheus-prometheus-node-exporter-42qg2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 723d
taskon earn-taskon-xyz-68cfbc454-dbxj2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 180d
taskon hello-app-rongyi-5669588578-rswv2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 243d
taskon taskon-chain-5fbf895fcc-pwn98 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24d
taskon taskon-feedback-7c644996d9-bn4tp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d17h
taskon taskon-operation-bot-5fcb96785b-j9879 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24d
taskon taskon-scanner-76c567f59c-bgzzt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17d
taskon www-taskon-xyz-7f64cb9f49-929c8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d16h
taskon www-taskon-xyz-rongyi-test-6c6f7594b9-tgwmf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d16h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits

cpu 420m (10%) 2300m (58%)
memory 313829120 (5%) 1528857600 (26%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message

Warning SystemOOM 12m kubelet System OOM encountered, victim process: uvicorn, pid: 588292
Warning SystemOOM 11m kubelet System OOM encountered, victim process: python, pid: 588293
Normal NodeNotReady 11m (x6 over 77d) kubelet Node 172.22.0.22 status is now: NodeNotReady
Warning SystemOOM 10m kubelet System OOM encountered, victim process: taskon-scanner, pid: 588233
Warning SystemOOM 8m29s kubelet System OOM encountered, victim process: qcloud_ingress, pid: 1331737
Normal NodeReady 8m26s (x7 over 392d) kubelet Node 172.22.0.22 status is now: NodeReady
ubuntu@VM-0-22-ubuntu:~$

ubuntu@VM-0-22-ubuntu:$ kubectl get pods –all-namespaces –field-selector spec.nodeName=172.22.0.22
NAMESPACE NAME READY STATUS RESTARTS AGE
ingress-nginx ingress-nginx-controller-8c675bd6c-vc787 1/1 Running 10 (10m ago) 243d
kube-system cls-provisioner-75df9b4649-hgfmb 1/1 Running 0 353d
kube-system csi-cbs-node-5dssr 2/2 Running 5 (392d ago) 728d
kube-system ip-masq-agent-tpgqg 1/1 Running 2 (392d ago) 728d
kube-system kube-proxy-8c48l 1/1 Running 2 (392d ago) 728d
kube-system l7-lb-controller-7bffbb7f6f-5htzl 1/1 Running 7 (10m ago) 420d
kube-system tke-bridge-agent-wbk72 1/1 Running 4 (392d ago) 728d
kube-system tke-cni-agent-b9mlz 1/1 Running 2 (392d ago) 728d
kube-system tke-log-agent-gzbnd 3/3 Running 17 (9m55s ago) 723d
kube-system tke-monitor-agent-x5cfl 1/1 Running 2 (392d ago) 647d
monitoring prometheus-prometheus-node-exporter-42qg2 1/1 Running 4 (392d ago) 723d
taskon earn-taskon-xyz-68cfbc454-dbxj2 1/1 Running 0 180d
taskon hello-app-rongyi-5669588578-rswv2 1/1 Running 0 243d
taskon taskon-chain-5fbf895fcc-pwn98 1/1 Running 2 (16d ago) 24d
taskon taskon-feedback-7c644996d9-bn4tp 1/1 Running 0 2d17h
taskon taskon-operation-bot-5fcb96785b-j9879 1/1 Running 4 (13m ago) 24d
taskon taskon-scanner-76c567f59c-bgzzt 1/1 Running 2 (12m ago) 17d
taskon www-taskon-xyz-7f64cb9f49-929c8 1/1 Running 0 2d16h
taskon www-taskon-xyz-rongyi-test-6c6f7594b9-tgwmf 1/1 Running 0 2d16h
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:~$ journalctl -u kubelet -f
– Logs begin at Sun 2025-03-30 08:19:41 CST. –
Jun 02 10:17:28 VM-0-22-ubuntu kubelet[935]: E0602 10:17:28.086371 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:30 VM-0-22-ubuntu kubelet[935]: E0602 10:17:30.087557 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:32 VM-0-22-ubuntu kubelet[935]: E0602 10:17:32.080013 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:34 VM-0-22-ubuntu kubelet[935]: E0602 10:17:34.086785 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:36 VM-0-22-ubuntu kubelet[935]: E0602 10:17:36.087171 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:38 VM-0-22-ubuntu kubelet[935]: E0602 10:17:38.080112 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:40 VM-0-22-ubuntu kubelet[935]: E0602 10:17:40.103049 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:42 VM-0-22-ubuntu kubelet[935]: E0602 10:17:42.087084 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:44 VM-0-22-ubuntu kubelet[935]: E0602 10:17:44.084214 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:46 VM-0-22-ubuntu kubelet[935]: E0602 10:17:46.087193 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:48 VM-0-22-ubuntu kubelet[935]: E0602 10:17:48.087289 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:50 VM-0-22-ubuntu kubelet[935]: E0602 10:17:50.080236 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:52 VM-0-22-ubuntu kubelet[935]: E0602 10:17:52.088166 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:54 VM-0-22-ubuntu kubelet[935]: E0602 10:17:54.086533 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:56 VM-0-22-ubuntu kubelet[935]: E0602 10:17:56.079277 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:17:58 VM-0-22-ubuntu kubelet[935]: E0602 10:17:58.103583 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:18:00 VM-0-22-ubuntu kubelet[935]: E0602 10:18:00.079025 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
Jun 02 10:18:02 VM-0-22-ubuntu kubelet[935]: E0602 10:18:02.085925 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5

Jun 02 10:18:04 VM-0-22-ubuntu kubelet[935]: E0602 10:18:04.094224 935 kubelet_volumes.go:245] “There were many similar errors. Turn up verbosity to see them.” err=”orphaned pod "5ae5c64c-a54e-42e5-9013-24ac93d4347a" found, but error not a directory occurred when trying to remove the volumes dir” numErrs=5
^C
ubuntu@VM-0-22-ubuntu:$
ubuntu@VM-0-22-ubuntu:$ systemctl status docker
Unit docker.service could not be found.

你说得对，从监控数据看 CPU 使用率确实不高（只有 6%），但是我发现了问题的根本原因：

关键问题：频繁的系统 OOM（内存溢出）

从 kubectl describe node 的 Events 部分可以看到：

Warning  SystemOOM     12m     kubelet  System OOM encountered, victim process: uvicorn, pid: 588292
Warning  SystemOOM     11m     kubelet  System OOM encountered, victim process: python, pid: 588293
Warning  SystemOOM     10m     kubelet  System OOM encountered, victim process: taskon-scanner, pid: 588233
Warning  SystemOOM     8m29s   kubelet  System OOM encountered, victim process: qcloud_ingress, pid: 1331737

这是导致 PLEG 延迟高的真正原因！

问题分析

系统内存压力：
- 节点总内存：7.6GB，可分配：5.6GB
- 当前内存使用：1669Mi (30%)
- 但频繁发生 OOM，说明实际内存使用接近极限
OOM 对 PLEG 的影响：
- 系统 OOM 时，内核会暂停所有进程进行内存回收
- kubelet 的 PLEG 操作被阻塞，导致 relist 延迟飙升
- 进程被杀死后重启，进一步增加系统负载
内存分配问题：
- 很多 Pod 没有设置内存 limits（显示为 0）
- 这可能导致某些容器消耗过多内存

解决方案

1. 立即处理

# 检查哪些进程占用内存最多
kubectl top pods --all-namespaces --sort-by=memory

# 查看系统内存使用详情
free -h
cat /proc/meminfo

2. 设置资源限制

为没有设置 memory limits 的 Pod 添加资源限制，特别是：

taskon 命名空间下的应用
earn-taskon-xyz、taskon-scanner 等

3. 监控和优化

# 检查具体的内存使用情况
kubectl describe pod <pod-name> -n <namespace>

# 查看容器资源使用历史
kubectl top pod --containers --all-namespaces

4. 长期解决

增加节点内存或添加新节点
为所有应用设置合理的资源请求和限制
考虑启用内存交换（如果可行）

总结：PLEG 延迟高不是因为 CPU 问题，而是因为频繁的系统 OOM 导致 kubelet 操作被阻塞。需要重点解决内存管理问题。

如何查看有没有触发kubectl cordon

可以通过以下几种方法查看节点是否被 kubectl cordon 标记为不可调度：

1. 查看节点调度状态

# 查看所有节点的调度状态
kubectl get nodes

# 查看特定节点详情
kubectl describe node 172.22.0.22

从你之前提供的输出中，我已经可以看到：

1	Unschedulable: false

这表明节点 没有被 cordon，仍然可以调度新的 Pod。

2. 检查节点 Taints

# 查看节点的 taints 信息
kubectl get node 172.22.0.22 -o yaml | grep -A 10 taints

# 或者使用 describe 查看
kubectl describe node 172.22.0.22 | grep -i taint

从你的输出中可以看到：

1	Taints: <none>

说明节点没有任何 taints。

3. 检查节点标签和注解

# 查看是否有相关的调度标签
kubectl get node 172.22.0.22 --show-labels

# 查看节点的完整 YAML 配置
kubectl get node 172.22.0.22 -o yaml

4. 查看 kubectl cordon 相关的状态指标

当节点被 cordon 时，你会看到：

Unschedulable: true
节点状态显示为 Ready,SchedulingDisabled
可能会有相关的 taint：node.kubernetes.io/unschedulable:NoSchedule

5. 实际验证方法

# 如果想测试节点是否能正常调度，可以创建一个测试 Pod
kubectl run test-pod --image=nginx --dry-run=client -o yaml | kubectl apply -f -

# 然后查看 Pod 是否能被调度到该节点
kubectl get pod test-pod -o wide

结论

根据你提供的节点信息：

Unschedulable: false - 节点可以调度
Taints: <none> - 没有阻止调度的 taints
节点状态为 Ready - 节点正常

该节点没有被 cordon，可以正常调度新的 Pod。

如果你怀疑调度有问题，可能需要检查：

是否有其他调度约束（如 nodeSelector、affinity 等）
资源是否充足（虽然从数据看资源使用不高）
是否有其他集群级别的调度策略

检查 label 调度错开啥意思

“检查 label 调度错开” 指的是使用 Kubernetes 的标签选择器和调度策略来避免某些 Pod 被调度到特定节点，或者将不同类型的工作负载分散到不同的节点上。

主要概念和用途

1. 节点标签 (Node Labels)

# 查看节点的所有标签
kubectl get node 172.22.0.22 --show-labels

# 从你的输出可以看到节点已有的标签：
# - kubernetes.io/hostname=172.22.0.22
# - node.kubernetes.io/instance-type=SA2.LARGE8
# - topology.kubernetes.io/zone=900003

2. 调度错开的几种方式

A. 使用 nodeSelector 指定调度

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    disktype: ssd  # 只调度到有此标签的节点

B. 使用 nodeAffinity 更灵活的调度

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: NotIn  # 避开特定节点
            values:
            - "172.22.0.22"

C. 使用 podAntiAffinity 分散 Pod

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - web
          topologyKey: kubernetes.io/hostname  # 同一个节点上避免多个相同应用

3. 实际检查方法

# 1. 查看节点标签
kubectl describe node 172.22.0.22 | grep Labels -A 20

# 2. 查看 Pod 的调度约束
kubectl get pod <pod-name> -o yaml | grep -A 20 affinity
kubectl get pod <pod-name> -o yaml | grep -A 5 nodeSelector

# 3. 查看 Pod 为什么没有调度到某个节点
kubectl describe pod <pending-pod-name>

# 4. 查看调度器事件
kubectl get events --sort-by=.metadata.creationTimestamp

4. 常见的调度错开场景

A. 环境隔离

# 给节点打标签区分环境
kubectl label node 172.22.0.22 env=production
kubectl label node other-node env=staging

# Pod 只调度到生产环境节点
nodeSelector:
  env: production

B. 工作负载类型隔离

1
2
3

# 区分计算密集型和 IO 密集型节点
kubectl label node 172.22.0.22 workload-type=compute-intensive
kubectl label node other-node workload-type=io-intensive

C. 高可用性分散

# 确保同一应用的多个副本分散在不同节点
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - my-app
        topologyKey: kubernetes.io/hostname

5. 检查当前集群的调度策略

# 查看集群中所有 Pod 的调度分布
kubectl get pods --all-namespaces -o wide | grep 172.22.0.22

# 查看特定应用的副本分布
kubectl get pods -l app=taskon-scanner -o wide --all-namespaces

# 查看是否有调度失败的 Pod
kubectl get pods --all-namespaces | grep Pending

根据你的节点情况，如果要实现调度错开，可能的原因包括：

某些应用配置了避开这个节点的调度策略
基于资源使用情况的智能调度
基于节点标签的环境隔离策略

需要具体查看 Pod 的调度配置才能确定是否有相关的调度约束。

https://kubernetes.io/zh-cn/docs/reference/kubectl/generated/kubectl_cordon/

去年就是加了个label,让vip只能taskon-server优先用,结果…

查看taskon-reward-5f9bf4757b-7q5j7这个pod占用的内存

可以通过以下几种方法查看 taskon-reward-5f9bf4757b-7q5j7 这个 Pod 的内存占用：

1. 直接查看 Pod 资源使用情况

# 查看单个 Pod 的资源使用
kubectl top pod taskon-reward-5f9bf4757b-7q5j7 -n taskon

# 如果不确定命名空间，可以查看所有命名空间
kubectl top pod taskon-reward-5f9bf4757b-7q5j7 --all-namespaces

# 查看 Pod 内所有容器的资源使用
kubectl top pod taskon-reward-5f9bf4757b-7q5j7 -n taskon --containers

2. 查看 Pod 的详细信息

# 查看 Pod 的详细描述，包括资源请求和限制
kubectl describe pod taskon-reward-5f9bf4757b-7q5j7 -n taskon

# 查看 Pod 的 YAML 配置
kubectl get pod taskon-reward-5f9bf4757b-7q5j7 -n taskon -o yaml

3. 查看 Pod 的资源配置

# 专门查看资源部分
kubectl get pod taskon-reward-5f9bf4757b-7q5j7 -n taskon -o jsonpath='{.spec.containers[*].resources}'

# 更详细的资源信息
kubectl get pod taskon-reward-5f9bf4757b-7q5j7 -n taskon -o jsonpath='{range .spec.containers[*]}{.name}{": requests="}{.resources.requests}{", limits="}{.resources.limits}{"\n"}{end}'

4. 如果 Pod 在运行，直接进入容器查看

# 进入容器查看系统资源使用
kubectl exec -it taskon-reward-5f9bf4757b-7q5j7 -n taskon -- top
kubectl exec -it taskon-reward-5f9bf4757b-7q5j7 -n taskon -- free -h
kubectl exec -it taskon-reward-5f9bf4757b-7q5j7 -n taskon -- ps aux --sort=-%mem | head

5. 查看所有 taskon 命名空间的 Pod 内存使用

# 查看 taskon 命名空间所有 Pod 的资源使用，按内存排序
kubectl top pods -n taskon --sort-by=memory

# 查看是否有这个 Pod
kubectl get pods -n taskon | grep taskon-reward

6. 查看历史资源使用（如果有监控）

1
2
3

# 如果集群有 Prometheus/Grafana，可以查看历史数据
# 查看 Pod 的事件，可能包含 OOM 信息
kubectl describe pod taskon-reward-5f9bf4757b-7q5j7 -n taskon | grep -A 10 Events

预期输出示例

正常情况下，kubectl top pod 的输出类似：

1 2	NAME CPU(cores) MEMORY(bytes) taskon-reward-5f9bf4757b-7q5j7 10m 128Mi

如果这个 Pod 是导致 OOM 的原因，你可能会看到：

内存使用很高（接近或超过限制）
在 kubectl describe pod 的 Events 中看到 OOM 相关的警告
Pod 可能处于 CrashLoopBackOff 状态

请先运行 kubectl top pod taskon-reward-5f9bf4757b-7q5j7 -n taskon 查看当前的内存使用情况，然后我们可以进一步分析是否这个 Pod 与之前看到的 OOM 问题相关。

这条命令好啊~

查看 taskon 命名空间所有 Pod 的资源使用，按内存排序

kubectl top pods -n taskon –sort-by=memory


ubuntu@VM-0-22-ubuntu:~$ kubectl top pods -n taskon --sort-by=memory
NAME                                                 CPU(cores)   MEMORY(bytes)   
taskon-reward-5f9bf4757b-7q5j7                       26m          2981Mi                 
taskon-backend-golang-withlog-xxl-7c8bcc9fbf-8cmzw   706m         1978Mi          
taskon-operation-bot-5fcb96785b-j9879                40m          367Mi           
taskon-operation-web-preview-67f87d86b-x9kxf         3m           256Mi           
taskon-backend-db-76bcb6d7f8-lbk96                   1m           225Mi           
taskon-operation-off-chain-86b5cd4676-bjlkg          10m          213Mi           
webapp-conversation-c45944f6-xlp8l                   1m           180Mi           
taskon-admin-python-55d7fb8b5b-zvmwd                 2m           180Mi           
taskon-uni-auth-f87cd57db-sjtf4                      3m           152Mi           
taskon-download-5dd65b757-bvnmn                      1m           132Mi           
extension-taskon-web-759c96df4d-cs5dg                0m           124Mi           
taskon-schedule-75c5994657-lsgg7                     33m          113Mi           
taskon-public-api-9648b688-7pqrg                     2m           101Mi           
taskon-onchain-verifier-8569f87cfb-dcjs6             1m           94Mi            
taskon-gas-station-948594cc-wrw5s                    3m           90Mi            
taskon-ont-74ddc7779d-npsht                          1m           80Mi            
taskon-sign-server-7777d885b-fwvv2                   1m           62Mi            
taskon-crm-659d54846f-4cxxv                          2m           60Mi            
taskon-actions-6fd859d8b6-965g8                      595m         60Mi            
taskon-chain-5fbf895fcc-pwn98                        2m           60Mi            
taskon-scanner-76c567f59c-bgzzt                      4m           57Mi            
taskon-backend-golang-withlog-9db77d668-vk7rm        1m           51Mi            
taskon-resource-579bbbc4b8-wcbs7                     1m           39Mi            
www-taskon-xyz-7f64cb9f49-wwlzx                      2m           38Mi            
taskon-payment-5bfc865d-4cmgv                        1m           35Mi            
taskon-operator-d89b858b8-bz86m                      1m           32Mi            
taskon-airdrop-nft-token-69cb64b9c5-8clfr            1m           31Mi            
taskon-alpha-756889d898-8z649                        1m           31Mi            
crm-website-taskon-xyz-7879488fcb-7vp8g              1m           31Mi            
www-taskon-xyz-7f64cb9f49-chhpz                      1m           29Mi            
bridge-taskon-xyz-765cb4cbf9-xprs4                   1m           29Mi            
earn-taskon-xyz-68cfbc454-dbxj2                      1m           29Mi            
taskon-agg-chain-node-86b958cc96-wdjsf               1m           26Mi            
taskon-event-tracker-5c65f58f58-shqps                7m           26Mi            
tgapp-taskon-xyz-5f97789d85-8dks7                    1m           26Mi            
taskon-feedback-7c644996d9-bn4tp                     3m           25Mi            
taskon-extension-backend-golang-6ccc47456d-v447l     1m           20Mi            
www-taskon-xyz-7f64cb9f49-929c8                      2m           19Mi            
www-taskon-xyz-rongyi-test-6c6f7594b9-tgwmf          1m           18Mi            
operator-website-taskon-xyz-776558966c-7qfxj         1m           18Mi            
mysql-ro0-exporter-exporter-8478844bfb-5ch97         1m           16Mi            
mysql-exporter-exporter-6d8966b7b8-bpxf9             1m           16Mi            
mysql-ro1-exporter-exporter-687b9658d5-qsfc2         1m           15Mi            
hello-app-rongyi-5669588578-rswv2                    1m           14Mi            
taskon-redis-buy-exporter-5c5c58578d-cdgph           1m           12Mi            
taskon-docs-web-65685f4c98-k4rbt                     0m           12Mi

action看起来很吃cpu

从grafana趋势看,reward和server还是有潜在内存泄露

ubuntu@VM-0-22-ubuntu:$ k -n monitoring get PrometheusRule
NAME AGE
mysql-connection-rules 723d
prometheus-kube-prometheus-alertmanager.rules 723d
prometheus-kube-prometheus-config-reloaders 723d
prometheus-kube-prometheus-etcd 723d
prometheus-kube-prometheus-general.rules 723d
prometheus-kube-prometheus-k8s.rules 723d
prometheus-kube-prometheus-kube-apiserver-availability.rules 723d
prometheus-kube-prometheus-kube-apiserver-burnrate.rules 723d
prometheus-kube-prometheus-kube-apiserver-histogram.rules 723d
prometheus-kube-prometheus-kube-apiserver-slos 723d
prometheus-kube-prometheus-kube-prometheus-general.rules 723d
prometheus-kube-prometheus-kube-prometheus-node-recording.rules 723d
prometheus-kube-prometheus-kube-state-metrics 723d
prometheus-kube-prometheus-kubelet.rules 723d
prometheus-kube-prometheus-kubernetes-apps 723d
prometheus-kube-prometheus-kubernetes-resources 723d
prometheus-kube-prometheus-kubernetes-storage 723d
prometheus-kube-prometheus-kubernetes-system 723d
prometheus-kube-prometheus-kubernetes-system-apiserver 723d
prometheus-kube-prometheus-kubernetes-system-kubelet 723d
prometheus-kube-prometheus-node-exporter 723d
prometheus-kube-prometheus-node-exporter.rules 723d
prometheus-kube-prometheus-node-network 723d
prometheus-kube-prometheus-node.rules 723d
prometheus-kube-prometheus-prometheus 723d
prometheus-kube-prometheus-prometheus-operator 723d
redis-rules 640d
taskon-decreasing-request 157d
twitter-validator-rule 723d
ubuntu@VM-0-22-ubuntu:$

原文链接: https://dashen.tech/2020/10/09/K8s环境下普罗米修斯的规则配置/

版权声明: 转载请注明出处.

如何查看watchdog这个规则配置，在哪个里面？

vector(1) 在普罗米修斯中什么意思？

问题分析

可能原因

排查步骤

解决建议

关键问题：频繁的系统 OOM（内存溢出）

问题分析

解决方案

1. 立即处理

2. 设置资源限制

3. 监控和优化

4. 长期解决

如何查看有没有触发kubectl cordon

1. 查看节点调度状态

2. 检查节点 Taints

3. 检查节点标签和注解

4. 查看 kubectl cordon 相关的状态指标

5. 实际验证方法

结论

检查 label 调度错开 啥意思

主要概念和用途

1. 节点标签 (Node Labels)

2. 调度错开的几种方式

A. 使用 nodeSelector 指定调度

B. 使用 nodeAffinity 更灵活的调度

C. 使用 podAntiAffinity 分散 Pod

3. 实际检查方法

4. 常见的调度错开场景

A. 环境隔离

B. 工作负载类型隔离

C. 高可用性分散

5. 检查当前集群的调度策略

查看taskon-reward-5f9bf4757b-7q5j7这个pod占用的内存

1. 直接查看 Pod 资源使用情况

2. 查看 Pod 的详细信息

3. 查看 Pod 的资源配置

4. 如果 Pod 在运行，直接进入容器查看

5. 查看所有 taskon 命名空间的 Pod 内存使用

6. 查看历史资源使用（如果有监控）

预期输出示例

查看 taskon 命名空间所有 Pod 的资源使用，按内存排序

文章目录

检查 label 调度错开啥意思