文章/答案/技术大牛

发布

社区首页 >问答首页 >[医] CrashLoopBackOff / OOMKilled难题

问[医] CrashLoopBackOff / OOMKilled难题
EN

Stack Overflow用户

提问于 2020-08-22 21:03:42

回答 2查看 6.1K关注 0票数 4

周期性地看到容器状态:137 OOMKilled (退出代码: 137)

但是它被安排到有大量内存的节点上

$ k get statefulset -n metrics 
NAME                      READY   AGE
prometheus                0/1     232d


$ k get po -n metrics
prometheus-0  1/2     CrashLoopBackOff   147        12h

$ k get events  -n metrics
LAST SEEN   TYPE      REASON    OBJECT             MESSAGE
10m         Normal    Pulled    pod/prometheus-0   Container image "prom/prometheus:v2.11.1" already present on machine
51s         Warning   BackOff   pod/prometheus-0   Back-off restarting failed container


k logs -f prometheus-0 -n metrics --all-containers=true

level=warn ts=2020-08-22T20:48:02.302Z caller=main.go:282 deprecation_notice="'storage.tsdb.retention' flag is deprecated use 'storage.tsdb.retention.time' instead."
level=info ts=2020-08-22T20:48:02.302Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
level=info ts=2020-08-22T20:48:02.302Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
level=info ts=2020-08-22T20:48:02.302Z caller=main.go:331 host_details="(Linux 4.14.186-146.268.amzn2.x86_64 #1 SMP Tue Jul 14 18:16:52 UTC 2020 x86_64 prometheus-0 (none))"
level=info ts=2020-08-22T20:48:02.302Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-08-22T20:48:02.303Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-08-22T20:48:02.307Z caller=main.go:652 msg="Starting TSDB ..."
level=info ts=2020-08-22T20:48:02.307Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-08-22T20:48:02.311Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1597968000000 maxt=1597975200000 ulid=01EG7FAW5PE9ARVHJNKW1SJXRK
level=info ts=2020-08-22T20:48:02.312Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1597975200000 maxt=1597982400000 ulid=01EG7P6KDPXPFVPSMBXBDF48FQ
level=info ts=2020-08-22T20:48:02.313Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1597982400000 maxt=1597989600000 ulid=01EG7X2ANPN30M8ET2S8EPGKEA
level=info ts=2020-08-22T20:48:02.314Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1597989600000 maxt=1597996800000 ulid=01EG83Y1XPXRWRRR2VQRNFB37F
level=info ts=2020-08-22T20:48:02.314Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1597996800000 maxt=1598004000000 ulid=01EG8ASS5P9J1TBZW2P4B2GV7P
level=info ts=2020-08-22T20:48:02.315Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598004000000 maxt=1598011200000 ulid=01EG8HNGDXMYRH0CGWNHKECCPR
level=info ts=2020-08-22T20:48:02.316Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598011200000 maxt=1598018400000 ulid=01EG8RH7NPHSC5PAGXCMN8K9HE
level=info ts=2020-08-22T20:48:02.317Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598018400000 maxt=1598025600000 ulid=01EG8ZCYXNABK8FD3ZGFSQ9NGQ
level=info ts=2020-08-22T20:48:02.317Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598025600000 maxt=1598032800000 ulid=01EG968P5T7SJTVDCZGN6D5YW2
level=info ts=2020-08-22T20:48:02.317Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598032800000 maxt=1598040000000 ulid=01EG9D4DDPR9SE62C0XNE0Z64C
level=info ts=2020-08-22T20:48:02.318Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598040000000 maxt=1598047200000 ulid=01EG9M04NYMAFACVCMDD2RF11W
level=info ts=2020-08-22T20:48:02.319Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598047200000 maxt=1598054400000 ulid=01EG9TVVXNJ7VCDXQNNK2BTZAE
level=info ts=2020-08-22T20:48:02.320Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1598054400000 maxt=1598061600000 ulid=01EGA1QK5PHHZ6P6TNPHDWSD81

k describe statefulset prometheus -n metrics
Name:               prometheus
Namespace:          metrics
CreationTimestamp:  Fri, 03 Jan 2020 04:33:58 -0800
Selector:           app=prometheus
Labels:             <none>
Annotations:        <none>
Replicas:           1 desired | 1 total
Update Strategy:    RollingUpdate
  Partition:        824644121032
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=prometheus
  Annotations:      checksum/config: 6982e2d83da89ab6fa57e1c2c8a217bb5c1f5abe13052a171cd8d5e238a40646
  Service Account:  prometheus
  Containers:
   prometheus-configmap-reloader:
    Image:      jimmidyson/configmap-reload:v0.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --volume-dir=/etc/prometheus
      --webhook-url=http://localhost:9090/-/reload
    Environment:  <none>
    Mounts:
      /etc/prometheus from prometheus (ro)
   prometheus:
    Image:      prom/prometheus:v2.11.1
    Port:       9090/TCP
    Host Port:  0/TCP
    Args:
      --config.file=/etc/prometheus/prometheus.yml
      --web.enable-lifecycle
      --web.enable-admin-api
      --storage.tsdb.path=/prometheus/data
      --storage.tsdb.retention=1d
    Limits:
      memory:     1Gi
    Liveness:     http-get http://:9090/-/healthy delay=180s timeout=1s period=120s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/prometheus from prometheus (rw)
      /etc/prometheus-alert-rules from prometheus-alert-rules (rw)
      /prometheus/data from prometheus-data-storage (rw)
  Volumes:
   prometheus:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus
    Optional:  false
   prometheus-alert-rules:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-alert-rules
    Optional:  false
Volume Claims:
  Name:          prometheus-data-storage
  StorageClass:  prometheus
  Labels:        <none>
  Annotations:   <none>
  Capacity:      20Gi
  Access Modes:  [ReadWriteOnce]
Events:          <none>

可能是什么原因？

kubernetes

prometheus

amazon-eks

prometheus-operator

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-08-23 16:56:16

我周期性地看到容器状态: But OOMKilled (退出代码: 137)，但是它被调度到有大量内存的节点上。

正如您可能已经看到的，很明显，您的命中率超过了配置的1GB。答案可能在于您如何使用Prometheus，以及您对1GB的使用限制。一些你可以看的东西：

时间序列数
时间序列平均标签
唯一标签对的数目
刮擦间隔
每个样本字节

您可以找到一个内存计算器，用于这里以上的使用。

✌️

票数 5

Stack Overflow用户

发布于 2022-04-27 09:03:50

对于Kubernetes监视来说，Prometheus的1Gi内存限制非常低，其中数百万的度量被从数千个目标(荚、节点、端点等)中收集。

建议提高普罗米修斯吊舱的内存限制，直到它停止与out of memory错误崩溃。

建议设置对Prometheus本身的监视--它在http://prometheus-host:9090/metrics url上导出自己的度量标准--例如，查看http://demo.robustperception.io:9090/metrics。

普罗米修斯的内存使用可以通过以下方式减少：

为了增加普罗米修斯的scrape_interval，它减少了对目标的抓取。这减少了Prometheus内存的使用，因为Prometheus存储最近在内存中刮取了最多2个小时的度量。
通过relabel_configs过滤掉不必要的擦伤目标。见https://www.robustperception.io/life-of-a-label。
通过metric_relabel_configs过滤掉不需要的指标。见https://www.robustperception.io/life-of-a-label。

还有其他类似Prometheus的解决方案，当刮取同一组目标时，它可以使用更少的内存。例如，请参见vmagent和VictoriaMetrics。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63541085

复制

相似问题

问[医] CrashLoopBackOff / OOMKilled难题
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问[医] CrashLoopBackOff / OOMKilled难题EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问[医] CrashLoopBackOff / OOMKilled难题
EN