dask worker是否可以向调度程序发出设置不正确的信号?
我遇到了一个问题,我的worker在一小部分时间内被错误地设置了。worker出了错,把我的整个图都弄坏了。(这不是任务的问题--糟糕的是工人本身。)我知道这种症状,并能捕捉到它,并希望工作人员说“嘿,调度员,把我作为一个工作人员删除--不要使用它”。
我正在使用dask-gateway,如果这很重要的话。
如果有帮助,工作人员(似乎只有几个百分点)无法访问libcuda.so.1:
/opt/conda/lib/python3.7/site-packages/cellpose/models.py in <module>
8 import cv2
9
---> 10 from mxnet import gluon, nd
11 import mxnet as mx
12
/opt/conda/lib/python3.7/site-packages/mxnet/__init__.py in <module>
22 from __future__ import absolute_import
23
---> 24 from .context import Context, current_context, cpu, gpu, cpu_pinned
25 from . import engine
26 from .base import MXNetError
/opt/conda/lib/python3.7/site-packages/mxnet/context.py in <module>
22 import warnings
23 import ctypes
---> 24 from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
25 from .base import _LIB
26 from .base import check_call
/opt/conda/lib/python3.7/site-packages/mxnet/base.py in <module>
212 __version__ = libinfo.__version__
213 # library instance of mxnet
--> 214 _LIB = _load_lib()
215
216 # type definitions
/opt/conda/lib/python3.7/site-packages/mxnet/base.py in _load_lib()
203 """Load library by searching possible path."""
204 lib_path = libinfo.find_lib_path()
--> 205 lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
206 # DMatrix functions
207 lib.MXGetLastError.restype = ctypes.c_char_p
/opt/conda/lib/python3.7/ctypes/__init__.py in __init__()
362
363 if handle is None:
--> 364 self._handle = _dlopen(self._name, mode)
365 else:
366 self._handle = handle
OSError: libcuda.so.1: cannot open shared object file: No such file or directory发布于 2020-06-22 21:54:05
你可以用WorkerPlugin做任何你想做的事情。尽管我会说你不想这么做。Worker初始化每次都应该起作用,您应该在这里修复根本原因。考虑到libcuda和dask-gateway的问题,您可能会对这个dask-gateway问题感兴趣:https://github.com/dask/dask-gateway/issues/177
https://stackoverflow.com/questions/62515753
复制相似问题