两个多星期以来,我已经观察到我的RDS实例(PostgreSQL 10.6在db.t3.small上)在工作时间内每天有2个小时的CPU峰值,同时增加了读和写延迟,从而导致应用程序中的响应性差或超时。
我确实进行了调查(见下文),在这一点上,我非常确信这些影响我的用户的高峰不是我的使用造成的,并且倾向于认为它们要么是RDS的流氓管理任务造成的,要么是一些PostgreSQL问题造成的。
有人忍受并解决了与PostgreSQL类似的问题吗?有人能帮我调查一下RDS的管理任务吗?或者给我指点其他的途径去查这些东西的真相?
我观察到:
我调查的是:
下面是峰值开始(激活语句日志之前)周围的基本日志:
2019-12-09 15:04:05 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:04:05 UTC::@:[4221]:LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.202 s, sync=0.001 s, total=0.213 s; sync files=2, longest=0.001 s, average=0.000 s; distance=16369 kB, estimate=16395 kB
2019-12-09 15:09:05 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:09:05 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.001 s, total=0.112 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16394 kB
2019-12-09 15:14:05 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:14:05 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.113 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16393 kB
2019-12-09 15:19:06 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:19:06 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.001 s, total=0.113 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16392 kB
[CPU PEAK STARTS here that day, at 16:20 UPC+1]
2019-12-09 15:24:06 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:24:06 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.114 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16391 kB
2019-12-09 15:29:06 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:29:06 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.101 s, sync=0.002 s, total=0.113 s; sync files=1, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16390 kB
2019-12-09 15:34:06 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:34:06 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.103 s, sync=0.002 s, total=0.118 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16390 kB
2019-12-09 15:39:06 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:39:06 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.104 s, sync=0.003 s, total=0.127 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16389 kB
2019-12-09 15:44:06 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:44:06 UTC::@:[4221]:LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.219 s, sync=0.010 s, total=0.303 s; sync files=2, longest=0.010 s, average=0.005 s; distance=16392 kB, estimate=16392 kB
2019-12-09 15:49:07 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:49:09 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.318 s, sync=0.516 s, total=2.426 s; sync files=1, longest=0.516 s, average=0.516 s; distance=16375 kB, estimate=16390 kB
2019-12-09 15:54:07 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:54:09 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.367 s, sync=1.230 s, total=2.043 s; sync files=1, longest=1.230 s, average=1.230 s; distance=16384 kB, estimate=16389 kB
2019-12-09 15:59:07 UTC::@:[4221]:LOG: checkpoint starting: time
2019-12-09 15:59:08 UTC::@:[4221]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.139 s, sync=0.195 s, total=1.124 s; sync files=1, longest=0.195 s, average=0.195 s; distance=16383 kB, estimate=16389 kBCPU在1个峰值附近,CPU超过一周,在峰值附近读取延迟,在峰值附近写入延迟,12月10日峰值前后的性能观察,12月9日峰值前后的性能观察
发布于 2019-12-14 08:19:57
这可能是由于PostgreSQL的后台进程,导致磁盘上的突发学分用完了。如果我没有记错,那么Rds上的所有磁盘都是gp2类型的。这意味着你有一个特定的基础iops和信用,你可以花一小段时间去超过它。您应该能够在监视页面的队列深度度量中看到这种效果。如果发生这种情况,你应该会看到这个数字中的一个峰值。最简单的解决方案就是增加磁盘大小。
https://stackoverflow.com/questions/59284005
复制相似问题