文章/答案/技术大牛

发布

社区首页 >问答首页 >在GKE ( Composer)中的自动标号节点上安装GPU驱动程序

问在GKE ( Composer)中的自动标号节点上安装GPU驱动程序
EN

Stack Overflow用户

提问于 2022-02-02 17:29:02

回答 2查看 553关注 0票数 2

我正在运行一个google云composer GKE集群。我有一个默认的节点池，包括3个普通CPU节点和一个带有GPU节点的节点池。GPU节点池已激活自动标度。

我想在那个GPU节点上的一个码头容器中运行一个脚本。

对于GPU操作系统，我决定使用cos_containerd而不是ubuntu。

我跟踪了https://cloud.google.com/kubernetes-engine/docs/how-to/gpus并运行了这一行：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

当我在GPU节点上运行"kubectl describe“时，GPU就会出现，但是我的测试脚本调试信息告诉我，GPU没有被使用。

当我通过ssh连接到自动配置的GPU节点时，我可以看到，我仍然需要运行

cos extensions gpu install

才能使用GPU。

现在，我想让我的云composer GKE集群在由autoscaler特性创建节点时运行"cos-extensions安装“。

我想申请一些类似yaml的东西：

#cloud-config

runcmd:
  - cos-extensions install gpu

我的云作曲家GKE集群。

我能用kubectl申请吗？理想情况下，我只想在GPU节点上运行该yaml代码。我怎样才能做到这一点？

我是库伯内特斯的新手，我已经花了很多时间在这件事上，但没有成功。任何帮助都将不胜感激。

最好，菲尔

更新： ok thx to Harsh，我意识到我必须通过Daemonset + ConfigMap，就像这里：https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial

我的GPU节点有标签

gpu-type=t4

因此，我创建并应用了这个ConfigMap：

apiVersion: v1
kind: ConfigMap
metadata:
  name: phils-init-script
  labels:
    gpu-type: t4
data:
  entrypoint.sh: |
    #!/usr/bin/env bash

    ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"

    chroot "${ROOT_MOUNT_DIR}" cos-extensions gpu install

下面是我的DaemonSet (我也应用了这个)：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: phils-cos-extensions-gpu-installer
  labels:
    gpu-type: t4
spec:
  selector:
    matchLabels:
      gpu-type: t4
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: phils-cos-extensions-gpu-installer
        gpu-type: t4
    spec:
      volumes:
      - name: root-mount
        hostPath:
          path: /
      - name: phils-init-script
        configMap:
          name: phils-init-script
          defaultMode: 0744
      initContainers:
      - image: ubuntu:18.04
        name: phils-cos-extensions-gpu-installer
        command: ["/scripts/entrypoint.sh"]
        env:
        - name: ROOT_MOUNT_DIR
          value: /root
        securityContext:
          privileged: true
        volumeMounts:
        - name: root-mount
          mountPath: /root
        - name: phils-init-script
          mountPath: /scripts
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

但是什么都没发生，我收到了"Pods正在等待“的信息。

在运行脚本期间，我通过ssh连接到GPU节点，可以看到ConfigMap外壳代码没有被应用。

我在这里错过了什么？

我拼命想让这件事成功。

最好，菲尔

到目前为止，感谢您的帮助！

kubernetes

google-cloud-platform

google-compute-engine

google-kubernetes-engine

google-cloud-composer

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-02 18:17:04

我能用kubectl申请吗？理想情况下，我只想在GPU节点上运行该yaml代码。我怎样才能做到这一点？

是的，您可以在每个节点上运行Deamon集，这将在节点上运行该命令。

当您在GKE和Daemon上时，set也会在新节点上运行命令或脚本，这些节点也会被放大。

Daemon集主要用于在集群中的每个可用节点上运行应用程序或部署。

我们可以利用这个deamon集，并在存在和即将到来的每个节点上运行该命令。

示例YAML：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-initializer
  labels:
    app: default-init
spec:
  selector:
    matchLabels:
      app: default-init
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: node-initializer
        app: default-init
    spec:
      volumes:
      - name: root-mount
        hostPath:
          path: /
      - name: entrypoint
        configMap:
          name: entrypoint
          defaultMode: 0744
      initContainers:
      - image: ubuntu:18.04
        name: node-initializer
        command: ["/scripts/entrypoint.sh"]
        env:
        - name: ROOT_MOUNT_DIR
          value: /root
        securityContext:
          privileged: true
        volumeMounts:
        - name: root-mount
          mountPath: /root
        - name: entrypoint
          mountPath: /scripts
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

例如，Github链接：https://github.com/GoogleCloudPlatform/solutions-gke-init-daemonsets-tutorial

精确部署步骤：雏形

全文：https://cloud.google.com/solutions/automatically-bootstrapping-gke-nodes-with-daemonsets

票数 1

Stack Overflow用户

发布于 2022-02-02 21:59:22

如果您已经安装了这么多次驱动程序，而nvidia-smi仍然无法进行通信，那么请查看prime-select。

运行prime-select query，这样您将获得所有可能的选项，它必须至少显示nvidia | intel。
选择prime-select nvidia。
然后，如果您看到nvidia is already selected，请选择另一个，例如prime-select intel。接下来，切换回nvidia prime-select nvidia。
重新启动并检查nvidia-smi。

另外，再运行一次可能是个好主意：

sudo apt install nvidia-cuda-toolkit

当它完成后，重新启动机器，nvidia必须工作。

现在，在其他情况下，它可以按照这些说明在VM 20.04上安装20.04和Cuda。

最后，在其他一些情况下，它是由无人值守的升级造成的。查看设置并调整它们，如果设置导致意外的结果。这个URL有Debian的文档，我看到您已经用这个发行版UnattendedUpgrades进行了测试。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70960103

复制

相似问题

问在GKE ( Composer)中的自动标号节点上安装GPU驱动程序
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在GKE ( Composer)中的自动标号节点上安装GPU驱动程序EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在GKE ( Composer)中的自动标号节点上安装GPU驱动程序
EN