首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >systemd 故障状态的未知原因 - systemd-logind 在服务器重启后无法启动 org.freedesktop.systemd1

systemd 故障状态的未知原因 - systemd-logind 在服务器重启后无法启动 org.freedesktop.systemd1
EN

Server Fault用户
提问于 2018-07-18 12:25:36
回答 1查看 4.4K关注 0票数 2

基本信息

代码语言:javascript
复制
Red Hat Enterprise Linux Server release 7.4 (Maipo)
component: systemd
Hardware: x86_64 Linux    
[root@scvberpat01 log]# uname -r
3.10.0-693.21.1.el7.x86_64

通用描述

嘿,各位,我这个周末有一个相当大的问题,导致我们公司的生产服务器必须重新启动。

这一切都与systemd服务进入FALURE状态以及无法再次通过dbus/systemd启动loginmanager org.freedesktop.login1和org.freedesktop.systemd1有关。似乎有些我无法识别的东西已经成功终止了systemd-logind服务。我们正在运行一个重型生产系统,平均每天同时执行大约1700个任务/进程。Cron 和 incrons 正在使用中。此外,自重新启动以来,systemd 无法启动模块 org.freedesktop.systemd1,导致登录时间和通过 su 更改用户的时间显著延迟。

代码语言:javascript
复制
systemd version
[root@scvberpat01 tmp]# systemctl --version
 systemd 219
 +PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

从systemd的FALURE状态开始的事件日志条目

代码语言:javascript
复制
Jul 14 05:09:40 scvberpat01 sshd[31966]: Accepted password for erp_monitoring from 172.20.0.63 port 58768 ssh2
 Jul 14 05:09:40 scvberpat01 systemd-logind: New session 1135476 of user erp_monitoring.
 Jul 14 05:09:40 scvberpat01 systemd: Started Session 1135476 of user erp_monitoring.
 Jul 14 05:09:40 scvberpat01 systemd: Starting Session 1135476 of user erp_monitoring.
 Jul 14 05:09:45 scvberpat01 sshd[32537]: Accepted password for erp_monitoring from 172.20.0.63 port 58801 ssh2
 Jul 14 05:09:45 scvberpat01 systemd-logind: New session 1135477 of user erp_monitoring.
 Jul 14 05:09:45 scvberpat01 systemd: Started Session 1135477 of user erp_monitoring.
 Jul 14 05:09:45 scvberpat01 systemd: Starting Session 1135477 of user erp_monitoring.
 Jul 14 05:10:01 scvberpat01 systemd: Started Session 1135478 of user root.
 Jul 14 05:10:01 scvberpat01 systemd: Starting Session 1135478 of user root.
 Jul 14 05:10:02 scvberpat01 systemd: Started Session 1135479 of user paseb.
 Jul 14 05:10:02 scvberpat01 systemd: Starting Session 1135479 of user paseb.
 Jul 14 05:10:02 scvberpat01 systemd: Started Session 1135480 of user root.
 Jul 14 05:10:02 scvberpat01 systemd: Starting Session 1135480 of user root.
 Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135482 of user paseb.
 Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135482 of user paseb.
 Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135481 of user paseb.
 Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135481 of user paseb.
 Jul 14 05:10:03 scvberpat01 systemd: Started Session 1135484 of user batchparg.
 Jul 14 05:10:03 scvberpat01 systemd: Starting Session 1135484 of user batchparg.
 Jul 14 05:10:04 scvberpat01 systemd: Started Session 1135485 of user batchparg.
 Jul 14 05:10:04 scvberpat01 systemd: Starting Session 1135485 of user batchparg.
 Jul 14 05:10:05 scvberpat01 systemd: Started Session 1135483 of user batchparg.
 Jul 14 05:10:05 scvberpat01 systemd: Starting Session 1135483 of user batchparg.
 Jul 14 05:10:06 scvberpat01 systemd: Started Session 1135486 of user batchparg.
 Jul 14 05:10:06 scvberpat01 systemd: Starting Session 1135486 of user batchparg.
 Jul 14 05:10:07 scvberpat01 systemd: Started Session 1135487 of user batchparg.
 Jul 14 05:10:07 scvberpat01 systemd: Starting Session 1135487 of user batchparg.
 Jul 14 05:10:08 scvberpat01 systemd: Started Session 1135488 of user batchparg.
 Jul 14 05:10:08 scvberpat01 systemd: Starting Session 1135488 of user batchparg.
 Jul 14 05:10:10 scvberpat01 systemd: Started Session 1135489 of user batchparg.
 Jul 14 05:10:10 scvberpat01 systemd: Starting Session 1135489 of user batchparg.
 Jul 14 05:10:11 scvberpat01 systemd: Started Session 1135490 of user batchparg.
 Jul 14 05:10:11 scvberpat01 systemd: Starting Session 1135490 of user batchparg.
 Jul 14 05:10:12 scvberpat01 systemd: Started Session 1135491 of user batchparg.
 Jul 14 05:10:12 scvberpat01 systemd: Starting Session 1135491 of user batchparg.
 Jul 14 05:10:13 scvberpat01 systemd: systemd-logind.service has no holdoff time, scheduling restart.
 Jul 14 05:10:13 scvberpat01 systemd: Starting Login Service...
 Jul 14 05:10:13 scvberpat01 systemd: Started Login Service.
 Jul 14 05:10:13 scvberpat01 systemd-logind: New seat seat0.
 Jul 14 05:10:13 scvberpat01 systemd-logind: Failed to read /run/systemd/users/11469: Argument list too long
 Jul 14 05:10:13 scvberpat01 systemd-logind: Failed to read /run/systemd/users/0: Argument list too long
 Jul 14 05:10:13 scvberpat01 systemd-logind: User enumeration failed: Argument list too long
 Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
 Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
 Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
 Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to stop user slice: No buffer space available
 Jul 14 05:11:58 scvberpat01 systemd-logind: Failed to start user slice user-11469.slice, ignoring: No buffer space available ((null))
 Jul 14 05:11:59 scvberpat01 systemd-logind: Failed to start user slice user-0.slice, ignoring: No buffer space available ((null))
 Jul 14 05:13:13 scvberpat01 systemd: systemd-logind.service watchdog timeout (limit 3min)!
 Jul 14 05:13:13 scvberpat01 abrt-hook-ccpp: Process 2577 (systemd-logind) of user 0 killed by SIGABRT - dumping core
 Jul 14 05:13:16 scvberpat01 ModemManager[1041]: [sleep-monitor] inhibit failed: GDBus.Error:org.freedesktop.DBus.Error.NoReply: Message did not receive a reply (timeout by message b
 us)
 Jul 14 05:13:16 scvberpat01 systemd: systemd-logind.service: main process exited, code=dumped, status=6/ABRT
 Jul 14 05:13:16 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
 Jul 14 05:13:16 scvberpat01 systemd: systemd-logind.service failed.
 Jul 14 05:13:21 scvberpat01 kernel: nr_pdflush_threads exported in /proc is scheduled for removal
 Jul 14 05:13:40 scvberpat01 dbus[1028]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
 Jul 14 05:13:41 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
 Jul 14 05:13:41 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
 Jul 14 05:13:41 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
 Jul 14 05:13:41 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
 Jul 14 05:13:41 scvberpat01 systemd: Failed to start Login Service.
 Jul 14 05:13:41 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
 Jul 14 05:13:41 scvberpat01 systemd: systemd-logind.service failed.
 Jul 14 05:14:05 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.login1': timed out
 Jul 14 05:14:06 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
 Jul 14 05:14:06 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
 Jul 14 05:14:06 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
 Jul 14 05:14:06 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
 Jul 14 05:14:06 scvberpat01 systemd: Failed to start Login Service.
 Jul 14 05:14:06 scvberpat01 systemd: Unit systemd-logind.service entered failed state.
 Jul 14 05:14:06 scvberpat01 systemd: systemd-logind.service failed.
 Jul 14 05:14:31 scvberpat01 systemd-logind: Failed to enable subscription: Connection timed out
 Jul 14 05:14:31 scvberpat01 dbus[1028]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
 Jul 14 05:14:31 scvberpat01 systemd-logind: Failed to fully start up daemon: Connection timed out
 Jul 14 05:14:31 scvberpat01 systemd: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
 ...
 ...

这种情况持续了一天半,直到严重影响到服务器。最后 ssh、samba 以及终端都不再响应。不幸的是,它之前没有被识别,因此进行了硬重置。

重启后,systemd似乎无法启动org.freedesktop.systemd1:

代码语言:javascript
复制
Jul 18 13:21:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
 Jul 18 13:21:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33081.scope: Connection timed out
 Jul 18 13:22:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
 Jul 18 13:22:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33086.scope: Activation of org.freedesktop.systemd1 timed out
 Jul 18 13:23:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Activation of org.freedesktop.systemd1 timed out (org.freedesktop.DBus.Error.TimedOut)
 Jul 18 13:23:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33092.scope: Activation of org.freedesktop.systemd1 timed out
 Jul 18 13:24:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Connection timed out ((null))
 Jul 18 13:24:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33097.scope: Activation of org.freedesktop.systemd1 timed out
 Jul 18 13:25:05 scvberpat01.pankl.local systemd-logind[1023]: Failed to start user slice user-11511.slice, ignoring: Connection timed out ((null))
 Jul 18 13:25:30 scvberpat01.pankl.local systemd-logind[1023]: Failed to start session scope session-33102.scope: Connection timed out

"busctl -list“显示org.freedesktop.systemd1未激活

代码语言:javascript
复制
[root@scvberpat01 log]# busctl --list
 NAME PID PROCESS USER CONNECTION UNIT SESSION DESCRIPTION
 :1.0 1023 systemd-logind root :1.0 systemd-logind.service - -
 :1.10 1076 NetworkManager root :1.10 NetworkManager.service - -
 :1.2 1019 avahi-daemon avahi :1.2 avahi-daemon.service - -
 :1.24 1348 cupsd root :1.24 cups.service - -
 :1.25 1342 tuned root :1.25 tuned.service - -
 :1.26 1583 colord colord :1.26 colord.service - -
 :1.27 1348 cupsd root :1.27 cups.service - -
 :1.28 2268 libvirtd root :1.28 libvirtd.service - -
 :1.3 1030 rtkit-daemon root :1.3 rtkit-daemon.service - -
 :1.34123 15157 abrt-dbus root :1.34123 dbus.service - -
 :1.34132 9585 busctl root :1.34132 sshd.service - -
 :1.384 4756 packagekitd root :1.384 packagekit.service - -
 :1.4 1029 ModemManager root :1.4 ModemManager.service - -
 :1.5 1020 polkitd polkitd :1.5 polkit.service - -
 :1.6 1088 accounts-daemon root :1.6 accounts-daemon.service - -
 :1.7 1076 NetworkManager root :1.7 NetworkManager.service - -
 com.redhat.RHSM1 - - - (activatable) - -
 com.redhat.RHSM1.Facts - - - (activatable) - -
 com.redhat.SubscriptionManager - - - (activatable) - -
 com.redhat.ifcfgrh1 1076 NetworkManager root :1.10 NetworkManager.service     - -
 com.redhat.problems.configuration - - - (activatable) - -
 com.redhat.tuned 1342 tuned root :1.25 tuned.service - -
 fi.epitest.hostap.WPASupplicant - - - (activatable) - -
 fi.w1.wpa_supplicant1 - - - (activatable) - -
 net.reactivated.Fprint - - - (activatable) - -
 org.bluez - - - (activatable) - -
 org.fedoraproject.SetroubleshootFixit - - - (activatable) - -
 org.fedoraproject.Setroubleshootd - - - (activatable) - -
 org.freedesktop.Accounts 1088 accounts-daemon root :1.6 accounts-    daemon.service - -
 org.freedesktop.Avahi 1019 avahi-daemon avahi :1.2 avahi-daemon.service - -
 org.freedesktop.ColorManager 1583 colord colord :1.26 colord.service - -
 org.freedesktop.DBus - - - - - - -
 org.freedesktop.Flatpak.SystemHelper - - - (activatable) - -
 org.freedesktop.GeoClue2 - - - (activatable) - -
 org.freedesktop.ModemManager1 1029 ModemManager root :1.4     ModemManager.service - -
 org.freedesktop.NetworkManager 1076 NetworkManager root :1.7     NetworkManager.service - -
 org.freedesktop.PackageKit 4756 packagekitd root :1.384 packagekit.service - -
 org.freedesktop.PolicyKit1 1020 polkitd polkitd :1.5 polkit.service - -
 org.freedesktop.RealtimeKit1 1030 rtkit-daemon root :1.3 rtkit-daemon.service - -
 org.freedesktop.UDisks2 - - - (activatable) - -
 org.freedesktop.UPower - - - (activatable) - -
 org.freedesktop.hostname1 - - - (activatable) - -
 org.freedesktop.import1 - - - (activatable) - -
 org.freedesktop.locale1 - - - (activatable) - -
 org.freedesktop.login1 1023 systemd-logind root :1.0 systemd-    logind.service - -
 org.freedesktop.machine1 - - - (activatable) - -
 org.freedesktop.nm_dispatcher - - - (activatable) - -
 org.freedesktop.problems 15157 abrt-dbus root :1.34123 dbus.service - -
 org.freedesktop.realmd - - - (activatable) - -
 org.freedesktop.systemd1 - - - (activatable) - -
 org.freedesktop.timedate1 - - - (activatable) - -
 org.gnome.GConf.Defaults - - - (activatable) - -
 org.opensuse.CupsPkHelper.Mechanism - - - (activatable) - -

跟踪用户更改显示18:34:56行的超时时间为30秒。这并没有告诉我太多信息,但也许有人可以帮助我。与我们没有面临这一重大事件的测试系统相比,这里没有显示超时。

代码语言:javascript
复制
18:34:56 close(3) = 0
 18:34:56 munmap(0x7fee1868b000, 4096) = 0
 18:34:56 socket(AF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
 18:34:56 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
 18:34:56 setsockopt(3, SOL_SOCKET, SO_PASSSEC, [0], 4) = 0
 18:34:56 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
 18:34:56 setsockopt(3, SOL_SOCKET, SO_RCVBUFFORCE, [8388608], 4) = 0
 18:34:56 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
 18:34:56 setsockopt(3, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = 0
 18:34:56 connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0
 18:34:56 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
 18:34:56 getsockopt(3, SOL_SOCKET, SO_PEERSEC, 0x5645d834d4f0, 0x7ffdc6169a50) = -1 ENOPROTOOPT (Protocol not available)
 18:34:56 fstat(3, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
 18:34:56 getsockopt(3, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
 18:34:56 getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
 18:34:56 geteuid() = 0
 18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45
 18:34:56 gettid() = 9342
 18:34:56 getrandom("\375\326f\327\327UZ\222\347T\242\240\304\227\5}", 16, GRND_NONBLOCK) = 16
 18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"OK 0345052a3855cdc590c849e45b4b4"..., 256}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 52
 18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/fre"..., 128}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 128
 18:34:56 recvmsg(3, 0x7ffdc6168960, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
 18:34:56 ppoll([{fd=3, events=POLLIN}], 1, {24, 999810000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999633311})
 18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\r\0\0\0\1\0\0\0E\0\0\0\6\1s\0\10\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
 18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{":1.25140\0\0\0\0\0\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0"..., 77}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 77
 18:34:56 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1p\0\0\0\2\0\0\0\230\0\0\0\1\1o\0\27\0\0\0/org/fre"..., 168}, {"\342,\0\0~$\0\0\4\0\0\0su-l\0\0\0\0\3\0\0\0tty\0\4\0\0\0"..., 112}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 280
 18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1\r\0\0\0\2\0\0\0\225\0\0\0\1\1o\0\25\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
 18:34:56 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0"..., 157}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 157
 18:34:56 recvmsg(3, 0x7ffdc6168b10, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
 18:34:56 ppoll([{fd=3, events=POLLIN}], 1, {24, 999829000}, NULL, 8) = 0 (Timeout)
 18:35:21 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4
 18:35:21 fstat(4, {st_mode=S_IFREG|0644, st_size=2502, ...}) = 0
 18:35:21 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fee1868b000
 18:35:21 read(4, "# Locale name alias data base.\n#"..., 4096) = 2502
 18:35:21 read(4, "", 4096) = 0
 18:35:21 close(4) = 0
 18:35:21 munmap(0x7fee1868b000, 4096) = 0

我需要帮助,既要找到导致 systemd 进入故障状态的可能根本原因,又要找到解决 org.freedesktop.systemd1 启动失败的解决方案。在我看来,这导致登录/用户更改延迟 30 秒。请考虑一下,我们讨论的是 24/7 生产系统。

提前感谢 Mario

EN

回答 1

Server Fault用户

发布于 2018-07-30 07:09:28

- systemd 42.17e版不能一次处理太多的会议,根据红帽子全球支持。它耗尽了记忆,并进一步崩溃了模块org.freedesktop.logind和org.freedesktop.systemd。

这个问题是错误的,并建议升级到systemd 52.17e。

  • 此问题可能因为“放弃”用户会话而发生。你可以用命令-- $ systemctl \ grep‘的命令检查被遗弃的用户’\ grep‘。
  • 您可以删除会话目录$ rm -rf /run/systemd/system/sessionscope

任何运行时问题都可以通过下面的命令来解决,它不会妨碍系统上的任何事情--

$ systemctl守护进程-reexec

如果守护进程-reexec在超时时失败,则应尝试

$杀死1

向玛丽问好。

票数 2
EN
页面原文内容由Server Fault提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://serverfault.com/questions/922465

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档