我在linux上玩进程和信号,下面是我用C编写的一个简单测试:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <mqueue.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <signal.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
void work(void);
int main(void) {
pid_t children[10];
for(size_t i = 0; i < 10; i++) {
pid_t pid = fork();
if(pid == -1) {
perror("parent: error forking");
return EXIT_FAILURE;
}
if(pid == 0) {
raise(SIGSTOP); // child stops itself
work(); // after resuming it goes on to execute work()
return EXIT_SUCCESS; // and finally, it successfully terminates
} else {
fprintf(stdout, "parent: spawned child (%d)\n", pid);
children[i] = pid;
}
}
// parent spawned all 10 children who are now stopped - begin resuming them one by one
for(size_t i = 0; i < 10; i++) {
fprintf(stdout, "parent: signaling child (%d) to continue...\n", children[i]);
if(kill(children[i], SIGCONT) == -1) {
fprintf(stderr, "parent: error signalling child (%d) to continue: %s\n", children[i], strerror(errno));
}
}
return EXIT_SUCCESS; // exit from parent once all children have been resumed
}
void work(void) {
pid_t mypid = getpid();
srand(mypid);
int32_t sleep_time = (rand() % 10) + 1;
fprintf(stdout, "(%d): began sleeping for %d seconds\n", mypid, sleep_time);
sleep(sleep_time);
fprintf(stdout, "(%d): done sleeping after %d seconds\n", mypid, sleep_time);
}其想法如下:
父进程生成10个子进程,每个子进程在生成后立即发送一个SIGSTOP。一旦父进程成功地生成了所有10个进程,它就会立即开始向所有10个进程发送一个SIGCONT。
一旦一个子进程被恢复,它就开始执行work() (它只是在0到10秒之间随机暂停它的执行,然后打印信息到stdout ),然后它就成功地终止了。
这就是成功输出的样子:
[I] bogdan in ~/dev/mserve
>> ./prog
parent: spawned child (138655)
parent: spawned child (138656)
parent: spawned child (138657)
parent: spawned child (138658)
parent: spawned child (138659)
parent: spawned child (138660)
parent: spawned child (138661)
parent: spawned child (138662)
parent: spawned child (138663)
parent: spawned child (138664)
parent: signaling child (138655) to continue...
(138655): began sleeping for 9 seconds
parent: signaling child (138656) to continue...
(138656): began sleeping for 3 seconds
parent: signaling child (138657) to continue...
parent: signaling child (138658) to continue...
parent: signaling child (138659) to continue...
parent: signaling child (138660) to continue...
parent: signaling child (138661) to continue...
parent: signaling child (138662) to continue...
parent: signaling child (138663) to continue...
parent: signaling child (138664) to continue...
(138659): began sleeping for 4 seconds
(138657): began sleeping for 5 seconds
(138658): began sleeping for 7 seconds
(138660): began sleeping for 10 seconds
(138663): began sleeping for 3 seconds
(138662): began sleeping for 7 seconds
(138664): began sleeping for 7 seconds
(138661): began sleeping for 2 seconds
[I] bogdan in ~/dev/mserve
(138661): done sleeping after 2 seconds
(138656): done sleeping after 3 seconds
(138663): done sleeping after 3 seconds
(138659): done sleeping after 4 seconds
(138657): done sleeping after 5 seconds
(138658): done sleeping after 7 seconds
(138662): done sleeping after 7 seconds
(138664): done sleeping after 7 seconds
(138655): done sleeping after 9 seconds
(138660): done sleeping after 10 seconds如信息消息所示,所有10个进程都成功地完成了睡眠并终止。
问题所在
也许每3次就有一次,随机的10个子过程被“卡住”并在SIGSTOP之后无法恢复。来自发送SIGCONT的父进程的kill(2)成功,但是进程(Es)仍然处于挂起状态。
然后输出如下所示:
[I] bogdan in ~/dev/mserve
> ./alt
parent: spawned child (139369)
parent: spawned child (139370)
parent: spawned child (139371)
parent: spawned child (139372)
parent: spawned child (139373)
parent: spawned child (139374)
parent: spawned child (139375)
parent: spawned child (139376)
parent: spawned child (139377)
parent: spawned child (139378)
parent: signaling child (139369) to continue...
parent: signaling child (139370) to continue...
parent: signaling child (139371) to continue...
parent: signaling child (139372) to continue...
parent: signaling child (139373) to continue...
parent: signaling child (139374) to continue...
parent: signaling child (139375) to continue...
parent: signaling child (139376) to continue...
parent: signaling child (139377) to continue...
parent: signaling child (139378) to continue...
(139371): began sleeping for 4 seconds
(139369): began sleeping for 8 seconds
(139373): began sleeping for 7 seconds
(139370): began sleeping for 3 seconds
(139375): began sleeping for 9 seconds
(139372): began sleeping for 7 seconds
(139374): began sleeping for 10 seconds
(139376): began sleeping for 8 seconds
(139377): began sleeping for 7 seconds
[I] bogdan in ~/dev/mserve
(139370): done sleeping after 3 seconds
(139371): done sleeping after 4 seconds
(139373): done sleeping after 7 seconds
(139372): done sleeping after 7 seconds
(139377): done sleeping after 7 seconds
(139369): done sleeping after 8 seconds
(139376): done sleeping after 8 seconds
(139375): done sleeping after 9 seconds
(139374): done sleeping after 10 seconds这一次只有9个过程成功完成(9条“已完成休眠”消息被打印出来)。
通过在shell中执行$ ps au,我可以观察到“卡住”进程(注意T状态):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
bogdan 139378 0.0 0.0 2312 80 pts/3 T 20:31 0:00 ./prog我甚至可以从我的外壳发出信号让他们继续:
$ kill -SIGCONT 139378(139378): began sleeping for 5 seconds
...
(139378): done sleeping after 5 seconds另一个奇怪的细节
当使用strace (例如$ strace ./program)执行父进程时,问题永远不会发生,所有10个过程都会在100%的时间内得到正确的恢复。只有当我直接从shell执行父程序时,我才能观察到问题。
我已经浏览了几次signal(7)手册,但我不明白为什么会发生这种情况。
发布于 2020-04-25 19:06:26
最有可能的情况是,如果父级将SIGCONT传递给尚未停止自身的子级,则无法恢复。这样的信号将被忽略,因为进程在处理时不会停止。
你的程序中没有任何东西能阻止这种情况发生;相反,你只是依靠孩子们比父母发出信号更快地停下来--这是一种种族状况。您可以通过向其发送一个(额外的) SIGCONT来恢复停滞不前的过程这一事实与此诊断是一致的,而且strace对时间的影响是合理的,孩子们总是能赢得他们的比赛。
发布于 2020-04-25 19:06:14
看起来,在发送自己为SIGSTOP的子进程和发送SIGCONT的父进程之间存在一个竞争条件。有时父进程在子进程发送SIGCONT之前发送SIGSTOP,因此子进程挂起。
https://stackoverflow.com/questions/61430874
复制相似问题