Kubernetes-监控Kubelet-工具盒子

上一篇我们介绍了如何监控Kube-Proxy，Kube-Proxy的/metrics接口没有认证，相对比较容易，这一篇我们介绍一下Kubelet，Kubelet的监控相比Kube-Proxy增加了认证机制，相对更复杂一些。

Kubelet 端口说明

如果你有多台Node节点，可以批量执行 ss -tlnp|grep kubelet 看一下，Kubelet 监听两个固定端口(我的环境，你的环境可能不同)，一个是10248，一个是10250，通过下面的命令可以知道，10248是健康检查的端口：

[root@tt-fc-dev01.nj ~]# ps aux|grep kubelet
root      163490  0.0  0.0  12136  1064 pts/1    S+   13:34   0:00 grep --color=auto kubelet
root      166673  3.2  1.0 3517060 81336 ?       Ssl  Aug16 4176:52 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --hostname-override=10.206.0.16 --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.6
[root@tt-fc-dev01.nj ~]# cat /var/lib/kubelet/config.yaml | grep 102
healthzPort: 10248
[root@tt-fc-dev01.nj ~]# curl localhost:10248/healthz
ok

我们再看一下 10250，10250实际是Kubelet的默认端口，/metrics 接口就是在这个端口暴露的，我们请求一下：

[root@tt-fc-dev01.nj ~]# curl localhost:10250/metrics Client sent an HTTP request to an HTTPS server. [root@tt-fc-dev01.nj ~]# curl https://localhost:10250/metrics curl: (60) SSL certificate problem: self signed certificate in certificate chain More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the web page mentioned above. [root@tt-fc-dev01.nj ~]# curl -k https://localhost:10250/metrics Unauthorized

-k 表示不校验SSL证书是否正确，最后的命令可以看到返回了 Unauthorized，表示认证失败，我们先来解决一下认证问题。认证是 Kubernetes 的一个知识点，这里先不展开（你需要Google一下了解基本常识），直接实操。

认证信息

下面的信息可以保存为 auth.yaml，创建了 ClusterRole、ServiceAccount、ClusterRoleBinding。

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: categraf-daemonset
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  - nodes/stats
  - nodes/proxy
  verbs:
  - get
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: categraf-daemonset
  namespace: flashcat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: categraf-daemonset
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: categraf-daemonset
subjects:
- kind: ServiceAccount
  name: categraf-daemonset
  namespace: flashcat

ClusterRole是个全局概念，不属于任一个namespace，定义了很多权限点，都是读权限，监控嘛，读权限就可以了，ServiceAccount则是namespace颗粒度的一个概念，这里我们创建了一个名为categraf-daemonset的ServiceAccount，然后绑定到ClusterRole上面，具备了各种查询权限。apply一下即可：

[work@tt-fc-dev01.nj yamls]$ kubectl apply -f auth.yaml clusterrole.rbac.authorization.k8s.io/categraf-daemonset created serviceaccount/categraf-daemonset created clusterrolebinding.rbac.authorization.k8s.io/categraf-daemonset created

[work@tt-fc-dev01.nj yamls]$ kubectl get ClusterRole | grep categraf-daemon categraf-daemonset 2022-11-14T03:53:54Z [work@tt-fc-dev01.nj yamls]$ kubectl get sa -n flashcat NAME SECRETS AGE categraf-daemonset 1 90m default 1 4d23h [work@tt-fc-dev01.nj yamls]$ kubectl get ClusterRoleBinding -n flashcat | grep categraf-daemon categraf-daemonset ClusterRole/categraf-daemonset 91m

测试权限

上面的命令行输出可以看出来，我们已经成功创建了 ServiceAccount，把ServiceAccount的内容打印出来看一下：

[root@tt-fc-dev01.nj qinxiaohui]# kubectl get sa categraf-daemonset -n flashcat -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"name":"categraf-daemonset","namespace":"flashcat"}}
  creationTimestamp: "2022-11-14T03:53:54Z"
  name: categraf-daemonset
  namespace: flashcat
  resourceVersion: "120570510"
  uid: 22f5a785-871c-4454-b82e-12bf104450a0
secrets:
- name: categraf-daemonset-token-7mccq

注意最后两行，这个ServiceAccount实际是关联了一个Secret，我们再看看这个Secret的内容：

[root@tt-fc-dev01.nj qinxiaohui]# kubectl get secret categraf-daemonset-token-7mccq -n flashcat -o yaml
apiVersion: v1
data:
  ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeU1ERXdPVEF4TXpjek9Gb1hEVE15TURFd056QXhNemN6T0Zvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBS2F1Ck9wU3hHdXB0ZlNraW1zbmlONFVLWnp2b1p6akdoTks1eUVlZWFPcmptdXIwdTFVYlFHbTBRWlpMem8xVi9GV1gKVERBOUthcFRNVllyS2hBQjNCVXdqdGhCaFp1NjJVQzg5TmRNSDVzNFdmMGtMNENYZWQ3V2g2R05Md0MyQ2xKRwp3Tmp1UkZRTndxMWhNWjY4MGlaT1hLZk1NbEt6bWY4aDJWZmthREdpVHk0VzZHWE5sRlRJSFFkVFBVMHVMY3dYCmc1cUVsMkd2cklmd05JSXBOV3ZoOEJvaFhyc1pOZVNlNHhGMVFqY0R2QVE4Q0xta2J2T011UGI5bGtwalBCMmsKV055RTVtVEZCZ2NCQ3dzSGhjUHhyN0E3cXJXMmtxbU1MbUJpc2dHZm9ieXFWZy90cTYzS1oxYlRvWjBIbXhicQp6TkpOZUJpbm9jbi8xblJBK3NrQ0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZLVkxrbVQ5RTNwTmp3aThsck5UdXVtRm1MWHNNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBSm5QR24rR012S1ZadFVtZVc2bQoxanY2SmYvNlBFS2JzSHRkN2dINHdwREI3YW9pQVBPeTE0bVlYL2d5WWgyZHdsRk9hTWllVS9vUFlmRDRUdGxGCkZMT08yVkdLVTJBSmFNYnVBekw4ZTlsTFREM0xLOGFJUm1FWFBhQkR2V3VUYXZuSTZCWDhiNUs4SndraVd0R24KUFh0ejZhOXZDK1BoaWZDR0phMkNxQWtJV0Nrc0lWenNJcWJ0dkEvb1pHK1dhMlduemFlMC9OUFl4QS8waldOMwpVcGtDWllFaUQ4VlUwenRIMmNRTFE4Z2Mrb21uc3ljaHNjaW5KN3JsZS9XbVFES3ZhVUxLL0xKVTU0Vm1DM2grCnZkaWZtQStlaFZVZnJaTWx6SEZRbWdzMVJGMU9VczNWWUd0REt5YW9uRkc0VFlKa1NvM0IvRlZOQ0ZtcnNHUTYKZWV3PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  namespace: Zmxhc2hjYXQ=
  token: ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklqRTJZVTlNU2pObFFVbEhlbmhDV1dsVmFIcEVTRlZVWVdoZlZVaDZSbmd6TUZGZlVWUjJUR0pzVUVraWZRLmV5SnBjM01pT2lKcmRXSmxjbTVsZEdWekwzTmxjblpwWTJWaFkyTnZkVzUwSWl3aWEzVmlaWEp1WlhSbGN5NXBieTl6WlhKMmFXTmxZV05qYjNWdWRDOXVZVzFsYzNCaFkyVWlPaUptYkdGemFHTmhkQ0lzSW10MVltVnlibVYwWlhNdWFXOHZjMlZ5ZG1salpXRmpZMjkxYm5RdmMyVmpjbVYwTG01aGJXVWlPaUpqWVhSbFozSmhaaTFrWVdWdGIyNXpaWFF0ZEc5clpXNHROMjFqWTNFaUxDSnJkV0psY201bGRHVnpMbWx2TDNObGNuWnBZMlZoWTJOdmRXNTBMM05sY25acFkyVXRZV05qYjNWdWRDNXVZVzFsSWpvaVkyRjBaV2R5WVdZdFpHRmxiVzl1YzJWMElpd2lhM1ZpWlhKdVpYUmxjeTVwYnk5elpYSjJhV05sWVdOamIzVnVkQzl6WlhKMmFXTmxMV0ZqWTI5MWJuUXVkV2xrSWpvaU1qSm1OV0UzT0RVdE9EY3hZeTAwTkRVMExXSTRNbVV0TVRKaVpqRXdORFExTUdFd0lpd2ljM1ZpSWpvaWMzbHpkR1Z0T25ObGNuWnBZMlZoWTJOdmRXNTBPbVpzWVhOb1kyRjBPbU5oZEdWbmNtRm1MV1JoWlcxdmJuTmxkQ0o5Lm03czJ2Z1JuZDJzMDJOUkVwakdpc0JYLVBiQjBiRjdTRUFqb2RjSk9KLWh6YWhzZU5FSDFjNGNDbXotMDN5Z1Rkal9NT1VKaWpCalRmaW9FSWpGZHRCS0hEMnNjNXlkbDIwbjU4VTBSVXVDemRYQl9tY0J1WDlWWFM2bE5zYVAxSXNMSGdscV9Sbm5XcDZaNmlCaWp6SU05QUNuckY3MGYtd1FZTkVLc2MzdGhubmhSX3E5MkdkZnhmdGU2NmhTRGthdGhPVFRuNmJ3ZnZMYVMxV1JCdEZ4WUlwdkJmVXpkQ1FBNVhRYVNPck00RFluTE5uVzAxWDNqUGVZSW5ka3NaQ256cmV6Tnp2OEt5VFRTSlJ2VHVKMlZOU2lHaDhxTEgyZ3IzenhtQm5Qb1d0czdYeFhBTkJadG0yd0E2OE5FXzY0SlVYS0tfTlhfYmxBbFViakwtUQ==
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: categraf-daemonset
    kubernetes.io/service-account.uid: 22f5a785-871c-4454-b82e-12bf104450a0
  creationTimestamp: "2022-11-14T03:53:54Z"
  name: categraf-daemonset-token-7mccq
  namespace: flashcat
  resourceVersion: "120570509"
  uid: 0a228da5-6e60-4b22-beff-65cc56683e41
type: kubernetes.io/service-account-token

我们把这个token字段拿到，然后base64转码一下，作为Bearer Token来请求测试一下：

[root@tt-fc-dev01.nj qinxiaohui]# token=`kubectl get secret categraf-daemonset-token-7mccq -n flashcat -o jsonpath={.data.token} | base64 -d`
[root@tt-fc-dev01.nj qinxiaohui]# curl -s -k -H "Authorization: Bearer $token" https://localhost:10250/metrics > aaaa
[root@tt-fc-dev01.nj qinxiaohui]# head -n 5 aaaa
# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0

通了！

这就说明我们创建的ServiceAccount是好使的，后面我们把 Categraf 作为采集器搞成 Daemonset，再为 Categraf 这个 Daemonset 指定 ServiceAccountName，Kubernetes就会自动把 Token 的内容挂到 Daemonset 的目录里，下面开始实操。

升级 Daemonset

上一篇咱们为 Kube-Proxy 的采集准备了 Daemonset，咱们就继续修改这个 Daemonset，让这个 Daemonset 不但可以采集 Kube-Proxy，也可以采集 Kubelet，先给 Categraf 准备一下相关的配置，可以把下面的内容保存为 categraf-configmap-v2.yaml

---
kind: ConfigMap
metadata:
  name: categraf-config
apiVersion: v1
data:
  config.toml: |
    [global]
    hostname = "$HOSTNAME"
    interval = 15
    providers = ["local"]
    [writer_opt]
    batch = 2000
    chan_size = 10000
    [[writers]]
    url = "http://10.206.0.16:19000/prometheus/v1/write"
    timeout = 5000
    dial_timeout = 2500
    max_idle_conns_per_host = 100    
---
kind: ConfigMap
metadata:
  name: categraf-input-prometheus
apiVersion: v1
data:
  prometheus.toml: |
    [[instances]]
    urls = ["http://127.0.0.1:10249/metrics"]
    labels = { job="kube-proxy" }
    [[instances]]
    urls = ["https://127.0.0.1:10250/metrics"]
    bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
    use_tls = true
    insecure_skip_verify = true
    labels = { job="kubelet" }
    [[instances]]
    urls = ["https://127.0.0.1:10250/metrics/cadvisor"]
    bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
    use_tls = true
    insecure_skip_verify = true
    labels = { job="cadvisor" }

apply 一下，让新的配置生效：

[work@tt-fc-dev01.nj yamls]$ kubectl apply -f categraf-configmap-v2.yaml -n flashcat
configmap/categraf-config unchanged
configmap/categraf-input-prometheus configured

Categraf 的 Daemonset 需要把 ServiceAccountName 给绑定上，上一讲咱们用的 yaml 命名为：categraf-daemonset-v1.yaml ，咱们升级一下这个文件到 categraf-daemonset-v2.yaml 版本，内容如下：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: categraf-daemonset
  name: categraf-daemonset
spec:
  selector:
    matchLabels:
      app: categraf-daemonset
  template:
    metadata:
      labels:
        app: categraf-daemonset
    spec:
      containers:
      - env:
        - name: TZ
          value: Asia/Shanghai
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: HOSTIP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: flashcatcloud/categraf:v0.2.18
        imagePullPolicy: IfNotPresent
        name: categraf
        volumeMounts:
        - mountPath: /etc/categraf/conf
          name: categraf-config
        - mountPath: /etc/categraf/conf/input.prometheus
          name: categraf-input-prometheus
      hostNetwork: true
      serviceAccountName: categraf-daemonset
      restartPolicy: Always
      tolerations:
      - effect: NoSchedule
        operator: Exists
      volumes:
      - configMap:
          name: categraf-config
        name: categraf-config
      - configMap:
          name: categraf-input-prometheus
        name: categraf-input-prometheus

这里跟 v1 版本相比，唯一的变化，就是加了 serviceAccountName: categraf-daemonset 这个配置，把原来的 Daemonset 删掉，从新创建一下：

[work@tt-fc-dev01.nj yamls]$ kubectl delete ds categraf-daemonset -n flashcat daemonset.apps "categraf-daemonset" deleted [work@tt-fc-dev01.nj yamls]$ kubectl apply -f categraf-daemonset-v2.yaml -n flashcat daemonset.apps/categraf-daemonset created waiting...

[work@tt-fc-dev01.nj yamls]$ kubectl get pods -n flashcat NAME READY STATUS RESTARTS AGE categraf-daemonset-d8jt8 1/1 Running 0 37s categraf-daemonset-fpx8v 1/1 Running 0 43s categraf-daemonset-mp468 1/1 Running 0 32s categraf-daemonset-s775l 1/1 Running 0 40s categraf-daemonset-wxkjk 1/1 Running 0 47s categraf-daemonset-zwscc 1/1 Running 0 35s

好了，我们去检查一下数据是否成功采集上来了：

上面这个指标是 Kubelet 自身的，即从 Kubelet 的 /metrics 接口采集的，我们再来看一个 cAdvisor 的，即从 /metrics/cadvisor 接口采集的：

看起来数据都上来了，导入监控大盘看看效果。

导入仪表盘

分成两部分，一个是 Kubelet 自身的仪表盘，JSON配置在这里，截图效果如下：

另外一个是Pod容器相关的大盘，JSON配置在这里(感谢张健老师悉心整理)

监控指标说明

之前孔飞老师整理的 Kubelet 相关指标的中文解释，我也一并附到这里，供大家参考：

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary gc的时间统计(summary指标) HELP go_goroutines Number of goroutines that currently exist. TYPE go_goroutines gauge goroutine 数量 HELP go_threads Number of OS threads created. TYPE go_threads gauge 线程数量 HELP kubelet_cgroup_manager_duration_seconds [ALPHA] Duration in seconds for cgroup manager operations. Broken down by method. TYPE kubelet_cgroup_manager_duration_seconds histogram 操作cgroup的时长分布，按照操作类型统计 HELP kubelet_containers_per_pod_count [ALPHA] The number of containers per pod. TYPE kubelet_containers_per_pod_count histogram pod中container数量的统计(spec.containers的数量) HELP kubelet_docker_operations_duration_seconds [ALPHA] Latency in seconds of Docker operations. Broken down by operation type. TYPE kubelet_docker_operations_duration_seconds histogram 操作docker的时长分布，按照操作类型统计 HELP kubelet_docker_operations_errors_total [ALPHA] Cumulative number of Docker operation errors by operation type. TYPE kubelet_docker_operations_errors_total counter 操作docker的错误累计次数，按照操作类型统计 HELP kubelet_docker_operations_timeout_total [ALPHA] Cumulative number of Docker operation timeout by operation type. TYPE kubelet_docker_operations_timeout_total counter 操作docker的超时统计，按照操作类型统计 HELP kubelet_docker_operations_total [ALPHA] Cumulative number of Docker operations by operation type. TYPE kubelet_docker_operations_total counter 操作docker的累计次数，按照操作类型统计 HELP kubelet_eviction_stats_age_seconds [ALPHA] Time between when stats are collected, and when pod is evicted based on those stats by eviction signal TYPE kubelet_eviction_stats_age_seconds histogram 驱逐操作的时间分布，按照驱逐信号(原因)分类统计 HELP kubelet_evictions [ALPHA] Cumulative number of pod evictions by eviction signal TYPE kubelet_evictions counter 驱逐次数统计，按照驱逐信号(原因)统计 HELP kubelet_http_inflight_requests [ALPHA] Number of the inflight http requests TYPE kubelet_http_inflight_requests gauge 请求kubelet的inflight请求数,按照method path server_type统计, 注意与每秒的request数区别开 HELP kubelet_http_requests_duration_seconds [ALPHA] Duration in seconds to serve http requests TYPE kubelet_http_requests_duration_seconds histogram 请求kubelet的请求时间统计, 按照method path server_type统计 HELP kubelet_http_requests_total [ALPHA] Number of the http requests received since the server started TYPE kubelet_http_requests_total counter 请求kubelet的请求数统计,按照method path server_type统计 HELP kubelet_managed_ephemeral_containers [ALPHA] Current number of ephemeral containers in pods managed by this kubelet. Ephemeral containers will be ignored if disabled by the EphemeralContainers feature gate, and this number will be 0. TYPE kubelet_managed_ephemeral_containers gauge 当前kubelet管理的临时容器数量 HELP kubelet_network_plugin_operations_duration_seconds [ALPHA] Latency in seconds of network plugin operations. Broken down by operation type. TYPE kubelet_network_plugin_operations_duration_seconds histogram 网络插件的操作耗时分布，按照操作类型(operation_type)统计, 如果 --feature-gates=EphemeralContainers=false, 否则一直为0 HELP kubelet_network_plugin_operations_errors_total [ALPHA] Cumulative number of network plugin operation errors by operation type. TYPE kubelet_network_plugin_operations_errors_total counter 网络插件累计操作错误数统计，按照操作类型(operation_type)统计 HELP kubelet_network_plugin_operations_total [ALPHA] Cumulative number of network plugin operations by operation type. TYPE kubelet_network_plugin_operations_total counter 网络插件累计操作数统计，按照操作类型(operation_type)统计 HELP kubelet_node_name [ALPHA] The node's name. The count is always 1. TYPE kubelet_node_name gauge node name HELP kubelet_pleg_discard_events [ALPHA] The number of discard events in PLEG. TYPE kubelet_pleg_discard_events counter PLEG(pod lifecycle event generator) 丢弃的event数统计 HELP kubelet_pleg_last_seen_seconds [ALPHA] Timestamp in seconds when PLEG was last seen active. TYPE kubelet_pleg_last_seen_seconds gauge PLEG上次活跃的时间戳 HELP kubelet_pleg_relist_duration_seconds [ALPHA] Duration in seconds for relisting pods in PLEG. TYPE kubelet_pleg_relist_duration_seconds histogram PLEG relist pod时间分布 HELP kubelet_pleg_relist_interval_seconds [ALPHA] Interval in seconds between relisting in PLEG. TYPE kubelet_pleg_relist_interval_seconds histogram PLEG relist 间隔时间分布 HELP kubelet_pod_start_duration_seconds [ALPHA] Duration in seconds for a single pod to go from pending to running. TYPE kubelet_pod_start_duration_seconds histogram pod启动时间(从pending到running)分布, kubelet watch到pod时到pod中contianer都running后, watch各种source channel的pod变更 HELP kubelet_pod_worker_duration_seconds [ALPHA] Duration in seconds to sync a single pod. Broken down by operation type: create, update, or sync TYPE kubelet_pod_worker_duration_seconds histogram pod状态变化的时间分布，按照操作类型(create update sync)统计, worker就是kubelet中处理一个pod的逻辑工作单位 HELP kubelet_pod_worker_start_duration_seconds [ALPHA] Duration in seconds from seeing a pod to starting a worker. TYPE kubelet_pod_worker_start_duration_seconds histogram kubelet watch到pod到worker启动的时间分布 HELP kubelet_run_podsandbox_duration_seconds [ALPHA] Duration in seconds of the run_podsandbox operations. Broken down by RuntimeClass.Handler. TYPE kubelet_run_podsandbox_duration_seconds histogram 启动sandbox的时间分布 HELP kubelet_run_podsandbox_errors_total [ALPHA] Cumulative number of the run_podsandbox operation errors by RuntimeClass.Handler. TYPE kubelet_run_podsandbox_errors_total counter 启动sanbox出现error的总数 HELP kubelet_running_containers [ALPHA] Number of containers currently running TYPE kubelet_running_containers gauge 当前containers运行状态的统计, 按照container状态统计，created running exited HELP kubelet_running_pods [ALPHA] Number of pods that have a running pod sandbox TYPE kubelet_running_pods gauge 当前处于running状态pod数量 HELP kubelet_runtime_operations_duration_seconds [ALPHA] Duration in seconds of runtime operations. Broken down by operation type. TYPE kubelet_runtime_operations_duration_seconds histogram 容器运行时的操作耗时(container在create list exec remove stop等的耗时) HELP kubelet_runtime_operations_errors_total [ALPHA] Cumulative number of runtime operation errors by operation type. TYPE kubelet_runtime_operations_errors_total counter 容器运行时的操作错误数统计(按操作类型统计) HELP kubelet_runtime_operations_total [ALPHA] Cumulative number of runtime operations by operation type. TYPE kubelet_runtime_operations_total counter 容器运行时的操作总数统计(按操作类型统计) HELP kubelet_started_containers_errors_total [ALPHA] Cumulative number of errors when starting containers TYPE kubelet_started_containers_errors_total counter kubelet启动容器错误总数统计(按code和container_type统计) code包括ErrImagePull ErrImageInspect ErrImagePull ErrRegistryUnavailable ErrInvalidImageName等 container_type一般为"container" "podsandbox" HELP kubelet_started_containers_total [ALPHA] Cumulative number of containers started TYPE kubelet_started_containers_total counter kubelet启动容器总数 HELP kubelet_started_pods_errors_total [ALPHA] Cumulative number of errors when starting pods TYPE kubelet_started_pods_errors_total counter kubelet启动pod遇到的错误总数(只有创建sandbox遇到错误才会统计) HELP kubelet_started_pods_total [ALPHA] Cumulative number of pods started TYPE kubelet_started_pods_total counter kubelet启动的pod总数 HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. TYPE process_cpu_seconds_total counter 统计cpu使用率 HELP process_max_fds Maximum number of open file descriptors. TYPE process_max_fds gauge 允许进程打开的最大fd数 HELP process_open_fds Number of open file descriptors. TYPE process_open_fds gauge 当前打开的fd数量 HELP process_resident_memory_bytes Resident memory size in bytes. TYPE process_resident_memory_bytes gauge 进程驻留内存大小 HELP process_start_time_seconds Start time of the process since unix epoch in seconds. TYPE process_start_time_seconds gauge 进程启动时间 HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL. TYPE rest_client_request_duration_seconds histogram 请求apiserver的耗时统计(按照url和请求类型统计verb) HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host. TYPE rest_client_requests_total counter 请求apiserver的总次数(按照返回码code和请求类型method统计) HELP storage_operation_duration_seconds [ALPHA] Storage operation duration TYPE storage_operation_duration_seconds histogram 存储操作耗时(按照存储plugin(configmap emptydir hostpath 等 )和operation_name分类统计) HELP volume_manager_total_volumes [ALPHA] Number of volumes in Volume Manager TYPE volume_manager_total_volumes gauge

本机挂载的volume数量统计(按照plugin_name和state统计 plugin_name包括"host-path" "empty-dir" "configmap" "projected") state(desired_state_of_world期状态/actual_state_of_world实际状态)

下面是 cAdvisor 指标梳理：

# HELP container_cpu_cfs_periods_total Number of elapsed enforcement period intervals. # TYPE container_cpu_cfs_periods_total counter cfs时间片总数, 完全公平调度的时间片总数(分配到cpu的时间片数) HELP container_cpu_cfs_throttled_periods_total Number of throttled period intervals. TYPE container_cpu_cfs_throttled_periods_total counter 容器被throttle的时间片总数 HELP container_cpu_cfs_throttled_seconds_total Total time duration the container has been throttled. TYPE container_cpu_cfs_throttled_seconds_total counter 容器被throttle的时间 HELP container_file_descriptors Number of open file descriptors for the container. TYPE container_file_descriptors gauge 容器打开的fd数 HELP container_memory_usage_bytes Current memory usage in bytes, including all memory regardless of when it was accessed TYPE container_memory_usage_bytes gauge 容器内存使用量，单位byte HELP container_network_receive_bytes_total Cumulative count of bytes received TYPE container_network_receive_bytes_total counter 容器入方向的流量 HELP container_network_transmit_bytes_total Cumulative count of bytes transmitted TYPE container_network_transmit_bytes_total counter 容器出方向的流量 HELP container_spec_cpu_period CPU period of the container. TYPE container_spec_cpu_period gauge 容器的cpu调度单位时间 HELP container_spec_cpu_quota CPU quota of the container. TYPE container_spec_cpu_quota gauge 容器的cpu规格，除以单位调度时间可以计算核数 HELP container_spec_memory_limit_bytes Memory limit for the container. TYPE container_spec_memory_limit_bytes gauge 容器的内存规格，单位byte HELP container_threads Number of threads running inside the container TYPE container_threads gauge 容器当前的线程数 HELP container_threads_max Maximum number of threads allowed inside the container, infinity if value is zero TYPE container_threads_max gauge

允许容器启动的最大线程数