基本信息
Red Hat Enterprise Linux Server release 7.4 (Maipo)
component: systemd
Hardware: x86_64 Linux
[root@scvberpat01 log]# uname -r
3.10.0-693.21.1.el7.x86_64嘿,各位,我这个周末有一个相当大的问题,导致我们公司的生产服务器必须重新启动。
这一切都与systemd服务进入FALURE状态以及无法再次通过dbus/systemd启动loginmanager org.freedesktop.login1和org.freedesktop.systemd1有关。似乎有些我无法识别的东西已经成功终止了systemd-logind服务。我们正在运行一个重型生产系统,平均每天同时执行大约1700个任务/进程。Cron 和 incrons 正在使用中。此外,自重新启动以来,systemd 无法启动模块 org.freedesktop.systemd1,导致登录时间和通过 su 更改用户的时间显著延迟。
systemd version
[root@scvberpat01 tmp]# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN从systemd的FALURE状态开始的事件日志条目
Jul 14 05:09:40 scvberpat01 sshd[31966]: Accepted password for erp_monitoring from 172.20.0.63 port 58768 ssh2
Jul 14 05:09:40 scvberpat01 systemd-logind: New session 1135476 of user erp_monitoring.
Jul 14 05:09:40 scvberpat01 systemd: Started Session 1135476 of user erp_monitoring.
Jul 14 05:09:40 scvberpat01 systemd: Starting Session 1135476 of user erp_monitoring.
Jul 14 05:09:45 scvberpat01 sshd[32537]: Accepted password for erp_monitoring from 172.20.0.63 port 58801 ssh2
Jul 14 05:09:45 scvberpat01 systemd-logind: New session 1135477 of user erp_monitoring.
Jul 14 05:09:45 scvberpat01 systemd: Started Session 1135477 of user erp_monitoring.
Jul 14 05:09:45 scvberpat01 systemd: Starting Session 1135477 of user erp_monitoring.
Jul 14 05:10:01 scvberpat01 systemd: Started Session 1135478 of user root.
Jul 14 05:10:01 scvberpat01 systemd: Starting Session 1135478 of user root.
Jul 14 05:10:02 scvberpat01 systemd: Started Session 1135479 of user paseb.
Jul 14 05:10:02 scvberpat01 systemd: Starting Session 1135479 of user paseb.
Jul 14 05:10:02 scvberpat01 systemd: Started Session 1135480 of user root.
Jul 14 05:10:02 scvberpat01 systemd: Starting Session 1135480 of user root.
Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135482 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135482 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135481 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135481 of user paseb.
Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135484 of user batchparg.
Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135484 of user batchparg.
Jul 14 05:10:04 scvberpat01 systemd: Started Session 1135485 of user batchparg.
Jul 14 05:10:04 scvberpat01 systemd: Starting Session 1135485 of user batchparg.
Jul 14 05:10:05 scvberpat01 systemd: Started Session 1135483 of user batchparg.
Jul 14 05:10:05 scvberpat01 systemd: Starting Session 1135483 of user batchparg.
Jul 14 05:10:06 scvberpat01 systemd: Started Session 1135486 of user batchparg.
Jul 14 05:10:06 scvberpat01 systemd: Starting Session 1135486 of user batchparg.
Jul 14 05:10:07 scvberpat01 systemd: Started Session 1135487 of user batchparg.
Jul 14 05:10:07 scvberpat01 systemd: Starting Session 1135487 of user batchparg.
Jul 14 05:10:08 scvberpat01 systemd: Started Session 1135488 of user batchparg.
Jul 14 05:10:08 scvberpat01 systemd: Starting Session 1135488 of user batchparg.
Jul 14 05:10:10 scvberpat01 systemd: Started Session 1135489 of user batchparg.
Jul 14 05:10:10 scvberpat01 systemd: Starting Session 1135489 of user batchparg.
Jul 14 05:10:11 scvberpat01 systemd: Started Session 1135490 of user batchparg.
Jul 14 05:10:11 scvberpat01 systemd: Starting Session 1135490 of user batchparg.
Jul 14 05:10:12 scvberpat01 systemd: Started Session 1135491 of user batchparg.
Jul 14 05:10:12 scvberpat01 systemd: Starting Session 1135491 of user batchparg.
Jul 14 05:10:13 scvberpat01 systemd: systemd-logind.service has no holdoff time, scheduling restart.
Jul 14 05:10:13 scvberpat01 systemd: Starting Login Service...
Jul 14 05:10:13 scvberpat01 systemd: Started Login Service.
Jul 14 05:10:13 scvberpat01 systemd-logind: New seat seat0.
Jul 14 05:10:13 scvberpat01 systemd-logind: Failed to read /run/systemd/users/11469: Argument list too long
Jul 14 05:10:13 scvberpat01 systemd-logind: Failed to read /run/systemd/users/0: Argument list too long
Jul 14 05:10:13 scvberpat01 systemd-logind: User enumeration failed: Argument list too long
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to start user slice user-11469.slice, ignoring: No buffer space available ((null))
Jul 14 05:11:59 scvberpat01 systemd-logind: Failed to start user slice user-0.slice, ignoring: No buffer space available ((null))
Jul 14 05:13:13 scvberpat01 systemd: systemd-logind.service watchdog timeout (limit 3min)!
Jul 14 05:13:13 scvberpat01 abrt-hook-ccpp: Process 2577 (systemd-logind) of user 0 killed by SIGABRT - dumping core
Jul 14 05:13:16 scvberpat01 ModemManager[1041]: [sleep-monitor] inhibit failed: GDBus.Error:org.freedesktop.DBus.Error.NoReply: Message did not receive a reply (timeout by message b
us)
Jul 14 05:13:16 scvberpat01 systemd: systemd-logind.service: main process exited, code=dumped, status=6/ABRT
Jul 14 05:13:16 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
Jul 14 05:13:16 scvberpat01 systemd: systemd-logind.service failed.
Jul 14 05:13:21 scvberpat01 kernel: nr_pdflush_threads exported in /proc is scheduled for removal
Jul 14 05:13:40 scvberpat01 dbus[1028]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 14 05:13:41 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
Jul 14 05:13:41 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
Jul 14 05:13:41 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 14 05:13:41 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Jul 14 05:13:41 scvberpat01 systemd: Failed to start Login Service.
Jul 14 05:13:41 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
Jul 14 05:13:41 scvberpat01 systemd: systemd-logind.service failed.
Jul 14 05:14:05 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 14 05:14:06 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
Jul 14 05:14:06 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
Jul 14 05:14:06 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 14 05:14:06 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Jul 14 05:14:06 scvberpat01 systemd: Failed to start Login Service.
Jul 14 05:14:06 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
Jul 14 05:14:06 scvberpat01 systemd: systemd-logind.service failed.
Jul 14 05:14:31 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
Jul 14 05:14:31 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 14 05:14:31 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
Jul 14 05:14:31 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
...
...这种情况持续了一天半,直到严重影响到服务器。最后 ssh、samba 以及终端都不再响应。不幸的是,它之前没有被识别,因此进行了硬重置。
Jul 18 13:21:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
Jul 18 13:21:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33081.scope: Connection timed out
Jul 18 13:22:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
Jul 18 13:22:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33086.scope: Activation of org.freedesktop.systemd1 timed out
Jul 18 13:23:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
Jul 18 13:23:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33092.scope: Activation of org.freedesktop.systemd1 timed out
Jul 18 13:24:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Connection timed out ((null))
Jul 18 13:24:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33097.scope: Activation of org.freedesktop.systemd1 timed out
Jul 18 13:25:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Connection timed out ((null))
Jul 18 13:25:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33102.scope: Connection timed out[root@scvberpat01 log]# busctl --list
NAME PID PROCESS USER CONNECTION UNIT SESSION DESCRIPTION
:1.0 1023 systemd-logind root :1.0 systemd-logind.service - -
:1.10 1076 NetworkManager root :1.10 NetworkManager.service - -
:1.2 1019 avahi-daemon avahi :1.2 avahi-daemon.service - -
:1.24 1348 cupsd root :1.24 cups.service - -
:1.25 1342 tuned root :1.25 tuned.service - -
:1.26 1583 colord colord :1.26 colord.service - -
:1.27 1348 cupsd root :1.27 cups.service - -
:1.28 2268 libvirtd root :1.28 libvirtd.service - -
:1.3 1030 rtkit-daemon root :1.3 rtkit-daemon.service - -
:1.34123 15157 abrt-dbus root :1.34123 dbus.service - -
:1.34132 9585 busctl root :1.34132 sshd.service - -
:1.384 4756 packagekitd root :1.384 packagekit.service - -
:1.4 1029 ModemManager root :1.4 ModemManager.service - -
:1.5 1020 polkitd polkitd :1.5 polkit.service - -
:1.6 1088 accounts-daemon root :1.6 accounts-daemon.service - -
:1.7 1076 NetworkManager root :1.7 NetworkManager.service - -
com.redhat.RHSM1 - - - (activatable) - -
com.redhat.RHSM1.Facts - - - (activatable) - -
com.redhat.SubscriptionManager - - - (activatable) - -
com.redhat.ifcfgrh1 1076 NetworkManager root :1.10 NetworkManager.service - -
com.redhat.problems.configuration - - - (activatable) - -
com.redhat.tuned 1342 tuned root :1.25 tuned.service - -
fi.epitest.hostap.WPASupplicant - - - (activatable) - -
fi.w1.wpa_supplicant1 - - - (activatable) - -
net.reactivated.Fprint - - - (activatable) - -
org.bluez - - - (activatable) - -
org.fedoraproject.SetroubleshootFixit - - - (activatable) - -
org.fedoraproject.Setroubleshootd - - - (activatable) - -
org.freedesktop.Accounts 1088 accounts-daemon root :1.6 accounts- daemon.service - -
org.freedesktop.Avahi 1019 avahi-daemon avahi :1.2 avahi-daemon.service - -
org.freedesktop.ColorManager 1583 colord colord :1.26 colord.service - -
org.freedesktop.DBus - - - - - - -
org.freedesktop.Flatpak.SystemHelper - - - (activatable) - -
org.freedesktop.GeoClue2 - - - (activatable) - -
org.freedesktop.ModemManager1 1029 ModemManager root :1.4 ModemManager.service - -
org.freedesktop.NetworkManager 1076 NetworkManager root :1.7 NetworkManager.service - -
org.freedesktop.PackageKit 4756 packagekitd root :1.384 packagekit.service - -
org.freedesktop.PolicyKit1 1020 polkitd polkitd :1.5 polkit.service - -
org.freedesktop.RealtimeKit1 1030 rtkit-daemon root :1.3 rtkit-daemon.service - -
org.freedesktop.UDisks2 - - - (activatable) - -
org.freedesktop.UPower - - - (activatable) - -
org.freedesktop.hostname1 - - - (activatable) - -
org.freedesktop.import1 - - - (activatable) - -
org.freedesktop.locale1 - - - (activatable) - -
org.freedesktop.login1 1023 systemd-logind root :1.0 systemd- logind.service - -
org.freedesktop.machine1 - - - (activatable) - -
org.freedesktop.nm_dispatcher - - - (activatable) - -
org.freedesktop.problems 15157 abrt-dbus root :1.34123 dbus.service - -
org.freedesktop.realmd - - - (activatable) - -
org.freedesktop.systemd1 - - - (activatable) - -
org.freedesktop.timedate1 - - - (activatable) - -
org.gnome.GConf.Defaults - - - (activatable) - -
org.opensuse.CupsPkHelper.Mechanism - - - (activatable) - -18:34:56 close(3) = 0
18:34:56 munmap(0x7fee1868b000, 4096) = 0
18:34:56 socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
18:34:56 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
18:34:56 setsockopt(3, SOL_SOCKET, SO_PASSSEC, [0], 4) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
18:34:56 setsockopt(3, SOL_SOCKET, SO_RCVBUFFORCE, [8388608], 4) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
18:34:56 setsockopt(3, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = 0
18:34:56 connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_PEERSEC, 0x5645d834d4f0, 0x7ffdc6169a50) = -1 ENOPROTOOPT (Protocol not available)
18:34:56 fstat(3, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
18:34:56 getsockopt(3, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
18:34:56 getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
18:34:56 geteuid() = 0
18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45
18:34:56 gettid() = 9342
18:34:56 getrandom("\375\326f\327\327UZ\222\347T\242\240\304\227\5}", 16, GRND_NONBLOCK) = 16
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"OK 0345052a3855cdc590c849e45b4b4"..., 256}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 52
18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/fre"..., 128}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 128
18:34:56 recvmsg(3, 0x7ffdc6168960, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
18:34:56 ppoll([{fd=3, events=POLLIN}], 1, {24, 999810000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999633311})
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\1\0\0\0E\0\0\0\6\1s\0\10\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{":1.25140\0\0\0\0\0\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0"..., 77}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 77
18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1p\0\0\0\2\0\0\0\230\0\0\0\1\1o\0\27\0\0\0/org/fre"..., 168}, {"\342,\0\0~$\0\0\4\0\0\0su-l\0\0\0\0\3\0\0\0tty\0\4\0\0\0"..., 112}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 280
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1\r\0\0\0\2\0\0\0\225\0\0\0\1\1o\0\25\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0"..., 157}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 157
18:34:56 recvmsg(3, 0x7ffdc6168b10, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
18:34:56 ppoll([{fd=3, events=POLLIN}], 1, {24, 999829000}, NULL, 8) = 0 (Timeout)
18:35:21 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
18:35:21 fstat(4, {st_mode=S_IFREG|0644, st_size=2502, ...}) = 0
18:35:21 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fee1868b000
18:35:21 read(4, "# Locale name alias data base.\n#"..., 4096) = 2502
18:35:21 read(4, "", 4096) = 0
18:35:21 close(4) = 0
18:35:21 munmap(0x7fee1868b000, 4096) = 0我需要帮助,既要找到导致 systemd 进入故障状态的可能根本原因,又要找到解决 org.freedesktop.systemd1 启动失败的解决方案。在我看来,这导致登录/用户更改延迟 30 秒。请考虑一下,我们讨论的是 24/7 生产系统。
提前感谢 Mario
发布于 2018-07-30 07:09:28
- systemd 42.17e版不能一次处理太多的会议,根据红帽子全球支持。它耗尽了记忆,并进一步崩溃了模块org.freedesktop.logind和org.freedesktop.systemd。
这个问题是错误的,并建议升级到systemd 52.17e。
任何运行时问题都可以通过下面的命令来解决,它不会妨碍系统上的任何事情--
$ systemctl守护进程-reexec
如果守护进程-reexec在超时时失败,则应尝试
$杀死1
向玛丽问好。
https://serverfault.com/questions/922465
复制相似问题