首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >ML Model pod在Seldon部署中持续重新启动

ML Model pod在Seldon部署中持续重新启动
EN

Stack Overflow用户
提问于 2020-06-25 15:09:41
回答 2查看 417关注 0票数 1

我有一个这样的Seldon部署:

代码语言:javascript
复制
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://seldon-models/mlflow/elasticnet_wine
        name: classifier
      name: default
      replicas: 1     

模型已从服务器成功下载,但过了一段时间后,pods转到crashloop状态并一次又一次重启。

当我看到日志时,没有任何错误,因为日志已经重新启动,我只能看到python包是如何下载的。

代码语言:javascript
复制
PS C:\Users\xxx\mlflow> kubectl logs -p -c wines-classifier model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
代码语言:javascript
复制
Executing before-run script
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
scipy-1.1.0          | 13.2 MB   | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
openssl-1.1.1g       | 2.5 MB    | ########## | 100%
mkl_fft-1.0.6        | 135 KB    | ########## | 100%
blas-1.0             | 6 KB      | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
sqlite-3.32.3        | 1.1 MB    | ########## | 100%
numpy-1.15.4         | 34 KB     | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
certifi-2020.6.20    | 156 KB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | #########  |  91%

现在,尝试@arghya-sadhu提出的-p参数:

代码语言:javascript
复制
PS C:\Users\xxx\mlflow> kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier
代码语言:javascript
复制
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
scipy-1.1.0          | 13.2 MB   | #########3 |  93%

以及pod的描述:

代码语言:javascript
复制
PS C:\Users\ivarea\repo\smartgraph\mlflow-v2> kubectl describe pod model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
代码语言:javascript
复制
Name:         model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Namespace:    default
Priority:     0
Node:         mlops-control-plane/172.19.0.2
Start Time:   Thu, 25 Jun 2020 10:08:20 +0200
Labels:       app=model-a-wines-classifier-0-wines-classifier
              fluentd=true
              pod-template-hash=5b8bc7889d
              seldon-app=model-a-wines-classifier
              seldon-app-svc=model-a-wines-classifier-wines-classifier
              seldon-deployment-id=model-a
              version=wines-classifier
Annotations:  prometheus.io/path: /prometheus
              prometheus.io/scrape: true
Status:       Running
IP:           10.244.0.17
IPs:
  IP:           10.244.0.17
Controlled By:  ReplicaSet/model-a-wines-classifier-0-wines-classifier-5b8bc7889d
Init Containers:
  wines-classifier-model-initializer:
    Container ID:  containerd://6a3b158cf4218f8c177f6d18eb5d0387946bf9cc36f1173754b68a029483da8b
    Image:         gcr.io/kfserving/storage-initializer:0.2.2
    Image ID:      gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://seldon-models/mlflow/model-a
      /mnt/models
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 10:08:24 +0200
      Finished:     Thu, 25 Jun 2020 10:08:47 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from wines-classifier-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Containers:
  wines-classifier:
    Container ID:   containerd://536753d25877994a17d1f1a63bbaf8717dc9180b80f061152688e4c8504c8468
    Image:          seldonio/mlflowserver_rest:0.5
    Image ID:       docker.io/seldonio/mlflowserver_rest@sha256:0fd54a0a314fafc82c490c91df0c4776be454702a307b4b76e12ed6958b4ee00
    Ports:          6000/TCP, 9000/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 25 Jun 2020 10:23:28 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 25 Jun 2020 10:19:09 +0200
      Finished:     Thu, 25 Jun 2020 10:20:41 +0200
    Ready:          False
    Restart Count:  7
    Liveness:       tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Readiness:      tcp-socket :http delay=20s timeout=1s period=5s #success=1 #failure=3
    Environment:
      PREDICTIVE_UNIT_SERVICE_PORT:          9000
      PREDICTIVE_UNIT_ID:                    wines-classifier
      PREDICTIVE_UNIT_IMAGE:                 seldonio/mlflowserver_rest:0.5
      PREDICTOR_ID:                          wines-classifier
      PREDICTOR_LABELS:                      {"version":"wines-classifier"}
      SELDON_DEPLOYMENT_ID:                  model-a
      PREDICTIVE_UNIT_METRICS_SERVICE_PORT:  6000
      PREDICTIVE_UNIT_METRICS_ENDPOINT:      /prometheus
      PREDICTIVE_UNIT_PARAMETERS:            [{"name":"model_uri","value":"/mnt/models","type":"STRING"}]
    Mounts:
      /etc/podinfo from podinfo (rw)
      /mnt/models from wines-classifier-provision-location (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
  seldon-container-engine:
    Container ID:  containerd://938e8f7e3ac23355c8a7a475b71ab54b858aff5ca485f26b99feaba09bb60069
    Image:         docker.io/seldonio/seldon-core-executor:1.1.0
    Image ID:      docker.io/seldonio/seldon-core-executor@sha256:661173fcbc6cb4e9b56db353b19e97d04d9c086e9dc445217f84dc1721bdf894
    Ports:         8000/TCP, 8000/TCP, 5001/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --sdep
      model-a
      --namespace
      default
      --predictor
      wines-classifier
      --http_port
      8000
      --grpc_port
      5001
      --transport
      rest
      --protocol
      seldon
      --prometheus_path
      /prometheus
    State:          Running
      Started:      Thu, 25 Jun 2020 10:08:51 +0200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   http-get http://:8000/live delay=20s timeout=60s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8000/ready delay=20s timeout=60s period=5s #success=1 #failure=3
    Environment:
      ENGINE_PREDICTOR:  <binary ommited>
      REQUEST_LOGGER_DEFAULT_ENDPOINT_PREFIX:  http://default-broker.
      SELDON_LOG_MESSAGES_EXTERNALLY:          false
    Mounts:
      /etc/podinfo from podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  wines-classifier-provision-location:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6vqwk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6vqwk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                          Message
  ----     ------     ----                 ----                          -------
  Normal   Scheduled  <unknown>            default-scheduler             Successfully assigned default/model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp to mlops-control-plane
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier-model-initializer
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier-model-initializer
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "seldonio/mlflowserver_rest:0.5" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "docker.io/seldonio/seldon-core-executor:1.1.0" already present on machine
  Normal   Created    14m                  kubelet, mlops-control-plane  Created container seldon-container-engine
  Normal   Started    14m                  kubelet, mlops-control-plane  Started container seldon-container-engine
  Warning  Unhealthy  14m (x8 over 14m)    kubelet, mlops-control-plane  Readiness probe failed: dial tcp 10.244.0.17:9000: connect: connection refused
  Warning  Unhealthy  28s (x171 over 14m)  kubelet, mlops-control-plane  Readiness probe failed: HTTP probe failed with statuscode: 503

如何禁用重启才能检查日志以查看实际错误?

EN

回答 2

Stack Overflow用户

发布于 2021-03-03 23:48:51

可能默认的活跃度和就绪探测的超时时间太短,无法让分类器容器完成依赖项的安装。在容器启动之前,Kubernetes已经重新启动了它,因为它的活动/就绪探测失败了。

在我的例子中,我必须将以下内容添加到Seldon部署声明中以增加超时(当然,您可以调整这些值):

代码语言:javascript
复制
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: ...
spec:
  name: ...
  predictors:
    - graph:
        ...
      name: ...
      replicas: ...
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                readinessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3
                livenessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3
票数 1
EN

Stack Overflow用户

发布于 2020-06-25 15:47:21

使用如下示例命令中的-p标志检查pod web-1(示例)容器日志中之前销毁的pod(示例)容器日志

代码语言:javascript
复制
kubectl logs -p -c ruby web-1

使用命令kubectl get events检查事件

使用kubectl describe pod podname检查可能导致crashloop的原因

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62569747

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档