我有用Haskell写的微型服务。它用的是斯科特。LTS为13.20。操作系统:Linux3.10.0-957.el7.x86_64,它在Kubernetes下工作。这项服务运作了大约05年,没有问题,但现在我遇到了几次神秘的冰冻。我认为这不是回归的结果,因为代码没有被修改,但是服务上的加载增加了。
症状如下:
strace报告在futex上冻结(.)电话:futex(0x349c9c4, FUTEX_WAIT_PRIVATE, 83, NULLF S UID PID SPID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 1 0 0 80 0 - 274 do_sig ? 00:00:00 myinit
4 S 0 6 6 1 0 80 0 - 2922 do_wai ? 00:00:00 rc.init
4 S 0 32 32 0 0 80 0 - 2955 do_wai pts/0 00:00:00 bash
4 S 0 4464 4464 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service
1 S 0 4464 4465 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4466 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4467 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4468 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4469 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4470 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4471 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4472 6 0 80 0 - 268854048 timerf ? 00:00:00 ghc_ticker
1 S 0 4464 4473 6 0 80 0 - 268854048 ep_pol ? 00:00:00 my-service:w
1 S 0 4464 4474 6 0 80 0 - 268854048 poll_s ? 00:00:00 my-service:w
1 S 0 4464 4475 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4476 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4477 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4478 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4479 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4480 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4481 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4482 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4483 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4484 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service
1 S 0 4464 4485 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service
1 S 0 4464 4486 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
1 S 0 4464 4487 6 0 80 0 - 268854048 futex_ ? 00:00:00 my-service:w
0 R 0 4511 4511 32 0 80 0 - 12405 - pts/0 00:00:00 ps用gdb附加到PID后
[New LWP 4487]
[New LWP 4486]
[New LWP 4485]
[New LWP 4484]
[New LWP 4483]
[New LWP 4482]
[New LWP 4481]
[New LWP 4480]
[New LWP 4479]
[New LWP 4478]
[New LWP 4477]
[New LWP 4476]
[New LWP 4475]
[New LWP 4474]
[New LWP 4473]
[New LWP 4472]
[New LWP 4471]
[New LWP 4470]
[New LWP 4469]
[New LWP 4468]
[New LWP 4467]
[New LWP 4466]
[New LWP 4465]
....
(gdb) bt full
#0 0x00007fc03ec23965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00000000015185e9 in waitCondition (pCond=pCond@entry=0x2f029c0, pMut=pMut@entry=0x2f029f0) at rts/posix/OSThreads.c:117
No locals.
#2 0x000000000150713b in waitForWorkerCapability (task=<optimized out>) at rts/Capability.c:651
cap = <optimized out>
#3 yieldCapability (pCap=pCap@entry=0x7fffc6ae0a78, task=task@entry=0x2f029b0, gcAllowed=gcAllowed@entry=true) at rts/Capability.c:888
cap = <optimized out>
#4 0x0000000001504d85 in scheduleYield (task=0x2f029b0, pcap=0x7fffc6ae0a70) at rts/Schedule.c:672
cap = 0x2e7cff0
didGcLast = <optimized out>
#5 schedule (initialCapability=initialCapability@entry=0x2edf1b0, task=task@entry=0x2f029b0) at rts/Schedule.c:292
t = <optimized out>
cap = 0x2e7cff0
ret = <optimized out>
prev_what_next = <optimized out>
ready_to_gc = <optimized out>
#6 0x0000000001505bee in scheduleWaitThread (tso=0x4200823388, ret=ret@entry=0x0, pcap=pcap@entry=0x7fffc6ae0b08) at rts/Schedule.c:2533
task = 0x2f029b0
cap = 0x2edf1b0
#7 0x0000000001500584 in rts_evalLazyIO (cap=cap@entry=0x7fffc6ae0b08, p=p@entry=0x15a00d0, ret=ret@entry=0x0) at rts/RtsAPI.c:530
tso = <optimized out>
#8 0x00000000015102be in hs_main (argc=2, argv=0x7fffc6ae0cf8, main_closure=0x15a00d0, rts_config=...) at rts/RtsMain.c:72
cap = 0x2edf1b0
exit_status = <optimized out>
status = <optimized out>
#9 0x00000000004311b0 in main ()
No symbol table info available.所以,我的问题是:如何解决,调查,什么是最好的尝试,检查?我想换到新的LTS,但我不确定这是问题的原因(我在Web论坛上发现了类似的LTS/GHC版本)。IMHO它看起来像是RTS中的一个bug。
发布于 2020-03-14 09:51:32
通常,当没有工作可做时,工人在waitForWorkerCapability中等待条件。所有haskell线程都在IO上被阻塞,所以我们没有什么可运行的。该条件在giveCapabilityToTask中发出信号,在同一文件中的少数地方被调用。
如果您确信应该有工作要做,那么您可能在RTS中发现了一个bug。尝试提供一个复制问题的最小示例。(我知道,这常常是不可能的。)
但是,您的代码或依赖项中可能有错误。您可以尝试检查gdb中的功能。(上面有一个全局阵列 )。您可能对running_task、run_queue_hd、suspended_ccalls、spare_workers、returning_tasks_hd字段感兴趣。我认为您不应该在所有功能上都有正在运行的任务和空的ran队列,并且所有的工作人员都应该在某些功能的spare_workers列表中。
(这只是我对这个问题的理解。我不是GHC RTS方面的专家,可能是胡说八道。)
https://stackoverflow.com/questions/60670801
复制相似问题