

Robusta 收集 Kubernetes Pod 崩溃时的 OOM 日志

robusta 的功能远不止本章介绍的这些,它可以去监控 Kubernetes,提供观测性,可以于 prometheus 接入,作为告警的二次处理,自动修复等,也提供了事件的时间线。

此前使用的是阿里的 kube-eventer,kube-eventer 仅仅只是提供了一个转发,因此 kube-eventer 只能解决的是事件触发的通知。

当然, 如果 robusta 也是仅仅止步于此,那也没用多少必要性去使用它。它还提供了另外一种非常有用的功能 : 事件告警。 在 robusta 的事件告警中,当侦测到后,会将预设中预设的 pod 状态连同最近一段日志发送到 slack. 这也是为什么会有这篇文章最重要的原因。


python 版本必须等于大于 3.7,于是我们升级版本。

升级 python:

$ wget https://www.python.org/ftp/python/3.9.16/Python-3.9.16.tar.xz
$ yum install gcc zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel -y
$ yum install libffi-devel -y
$ yum install zlib* -y

$ tar xf Python-3.9.16.tar.xz $ cd Python-3.9.16 $ ./configure --with-ssl --prefix=/usr/local/python3
$ make $ make install $ rm -rf /usr/bin/python3 /usr/bin/pip3 $ ln -s /usr/local/python3/bin/python3 /usr/bin/python3 $ ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3


$ mkdir -p ~/.pip/
$ cat > ~/.pip/pip.conf << EOF
trusted-host =  mirrors.aliyun.com
index-url = http://mirrors.aliyun.com/pypi/simple


参考官方文档 ^[1]^开始安装。

$ pip3 install -U robusta-cli --no-cache robusta gen-config

由于网络问题,我个人将使用使用 docker 进行配置。

$ curl -fsSL -o robusta https://docs.robusta.dev/master/_static/robusta
$ chmod +x robusta
$ ./robusta gen-config


docker pull registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/robusta-cli:latest
docker tag  us-central1-docker.pkg.dev/genuine-flight-317411/devel/robusta-cli:latest registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/robusta-cli:latest

$ ./robusta gen-config Robusta reports its findings to external destinations (we call them "sinks"). We'll define some of them now.

Configure Slack integration? This is HIGHLY recommended. [Y/n]: y


If your browser does not automatically launch, open the below url: https://api.robusta.dev/integrations/slack?id=64a3ee7c-5691-466f-80da-85e8ece80359


====================================================================== Error getting slack token ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

====================================================================== Error getting slack token HTTPSConnectionPool(host='api.robusta.dev', port=443): Max retries exceeded with url: /integrations/slack/get-token?id=64a3ee7c-5691-466f-80da-85e8ece80359 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f50b1f18cd0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

You've just connected Robusta to the Slack of: crow as a cock Which slack channel should I send notifications to? #

根据提示打开 If your browser does not automatically launch, open the below url: https://api.robusta.dev/integrations/slack?id=64a3ee7c-5691-466f-80da-85e8ece80359



此时 slack 已经有了 robusta 应用:


Which slack channel should I send notifications to? # devops



$ ./robusta  gen-config
Robusta reports its findings to external destinations (we call them "sinks").
We'll define some of them now.

Configure Slack integration? This is HIGHLY recommended. [Y/n]: y If your browser does not automatically launch, open the below url: https://api.robusta.dev/integrations/slack?id=d1fcbb13-5174-4027-a176-a3dcab10c27a

Error getting slack token HTTPSConnectionPool(host='api.robusta.dev', port=443): Max retries exceeded with url: /integrations/slack/get-token?id=d1fcbb13-5174-4027-a176-a3dcab10c27a (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f0ec508eee0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

You've just connected Robusta to the Slack of: crow as a cock Which slack channel should I send notifications to? # devops Configure MsTeams integration? [y/N]: n 配置MsTeams集成?[y / N]: N Configure Robusta UI sink? This is HIGHLY recommended. [Y/n]: y 配置Robusta UI接收器?这是强烈推荐的。[Y / n]: Enter your Gmail/Google address. This will be used to login: user@gmail.com 输入您的Gmail/谷歌地址。这将用于登录: Choose your account name (e.g your organization name): marksugar 选择您的帐户名称(例如您的组织名称): Successfully registered.

Robusta can use Prometheus as an alert source. If you haven't installed it yet, Robusta can install a pre-configured Prometheus. Would you like to do so? [y/N]: y 罗布斯塔可以使用普罗米修斯作为警报源。 如果你还没有安装它,罗布斯塔可以安装一个预先配置的Prometheus。 你愿意这样做吗?[y / N]: Please read and approve our End User License Agreement: https://api.robusta.dev/eula.html Do you accept our End User License Agreement? [y/N]: y 请阅读并批准我们的最终用户许可协议:https://api.robusta.dev/eula.html 您是否接受我们的最终用户许可协议?[y / N]: Last question! Would you like to help us improve Robusta by sending exception reports? [y/N]: n 最后一个问题!你愿意通过发送异常报告来帮助我们改进Robusta吗?[y / N]:

Saved configuration to ./generated_values.yaml - save this file for future use! Finish installing with Helm (see the Robusta docs). Then login to Robusta UI at https://platform.robusta.dev

By the way, we'll send you some messages later to get feedback. (We don't store your API key, so we scheduled future messages using Slack'sAPI) 保存配置到。/generated_values。保存这个文件以备将来使用! 完成Helm的安装(参见罗布斯塔文档)。然后登录到罗布斯塔用户界面https://platform.robusta.dev


上述完成后,创建了一个 generated_values.yaml

  signing_key: 92a8195-a3fa879b3f88
  account_id: 79efaf9c433294
- slack_sink:
    name: main_slack_sink
    slack_channel: devops
    api_key: xoxb-4715825756487-4749501ZZylPy1f
- robusta_sink:
    name: robusta_ui_sink
    token: eyJhY2NvjIn0=
enablePrometheusStack: true
enablePlatformPlaybooks: true
  sendAdditionalTelemetry: false
  public: LS0tLS1CRUdJTiBQTElDIEtFWS0tLS0tCg==


紧接着使用上述创建的 yaml 文件进行安装。我们适当调整下内容。

关于触发器的种类非常多,我们可以参考:example-triggers ^[2]^, java-troubleshooting ^[3]^,event-enrichment ^[4]^miscellaneous ^[5]^,kubernetes-triggers ^[6]^。我们可以针对某一组 pod 或者名称空间进行过滤去监控的特定的信息。

我们节选一些测试,并且加到 generated_values.yaml 种,如下:

  signing_key: 92a8195-a3fa879b3f88
  account_id: 79efaf9c433294
- slack_sink:
    name: main_slack_sink
    slack_channel: devops
    api_key: xoxb-4715825756487-4749501ZZylPy1f
- robusta_sink:
    name: robusta_ui_sink
    token: eyJhY2NvjIn0=
enablePrometheusStack: false
enablePlatformPlaybooks: true
  sendAdditionalTelemetry: false
  public: LS0tLS1CRUdJTiBQTElDIEtFWS0tLS0tCg==


  • triggers:

    • on_deployment_update: {} actions:
    • resource_babysitter: omitted_fields: [] fields_to_monitor: ["spec.replicas"]
  • triggers:

    • on_pod_crash_loop: restart_reason: "CrashLoopBackOff" restart_count: 1 rate_limit: 3600 actions:
    • report_crash_loop: {}
  • triggers:

    • on_pod_oom_killed: rate_limit: 900 exclude: - name: "oomkilled-pod" namespace: "default" actions:
    • pod_graph_enricher: resource_type: Memory display_limits: true
  • triggers:

    • on_container_oom_killed: rate_limit: 900 exclude: - name: "oomkilled-container" namespace: "default" actions:
    • oomkilled_container_graph_enricher: resource_type: Memory
  • triggers:

    • on_job_failure: namespace_prefix: robusta actions:
    • create_finding: title: "Job $name on namespace $namespace failed" aggregation_key: "Job Failure"
    • job_events_enricher:

runner: image: registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/robusta-runner:0.10.10 imagePullPolicy: IfNotPresent kubewatch: image: registry.cn-zhangjiakou.aliyuncs.com/marksugar-k8s/kubewatch:v2.0 imagePullPolicy: IfNotPresent

现在我们开始使用 helm 安装:

$ helm repo add robusta https://robusta-charts.storage.googleapis.com && helm repo update
$ helm upgrade --install robusta --namespace robusta  --create-namespace  robusta/robusta -f ./generated_values.yaml \
--set clusterName=test


$ helm upgrade --install robusta --namespace robusta  robusta/robusta -f ./generated_values.yaml  --set clusterName=test --dry-run 


$ helm upgrade --install robusta --namespace robusta  --create-namespace  robusta/robusta -f ./generated_values.yaml \
> --set clusterName=test
Release "robusta" does not exist. Installing it now.
NAME: robusta
LAST DEPLOYED: Thu Feb  2 15:58:32 2023
NAMESPACE: robusta
STATUS: deployed
Thank you for installing Robusta 0.10.10

As an open source project, we collect general usage statistics. This data is extremely limited and contains only general metadata to help us understand usage patterns. If you are willing to share additional data, please do so! It really help us improve Robusta.

You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data. This is disabled by default.

To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment.

Visit the web UI at: https://platform.robusta.dev/

等待 pod 就绪:

$ kubectl -n robusta get pod -w
NAME                                 READY   STATUS              RESTARTS   AGE
robusta-forwarder-78964b4455-vnt77   1/1     Running             0          2m55s
robusta-runner-758cf9c986-87l4x      0/1     ContainerCreating   0          2m55s
robusta-runner-758cf9c986-87l4x      1/1     Running             0          7m6s

此时如果你的集群上 pod 有异常状态的而崩溃的,在被删除前,将会将日志发送到 slack, slack 上已经可以收到日志信息了:


