首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用SageMaker Pytorch图像进行训练

使用SageMaker Pytorch图像进行训练
EN

Stack Overflow用户
提问于 2020-08-17 03:00:28
回答 2查看 577关注 0票数 0

我正在尝试将BERT模型的训练过程容器化,并在SageMaker上运行它。我计划使用预先构建的SageMaker Pytorch GPU容器(https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/)作为我的起点,但我在构建过程中遇到了拉取映像的问题。

我的Dockerfile看起来像这样:

代码语言:javascript
复制
# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04


ENV PATH="/opt/ml/code:${PATH}"

# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /bert /opt/ml/code

# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# this environment variable is used by the SageMaker PyTorch container to determine our program entry point
# for training and serving.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM bert/train

我的build_and_push脚本:

代码语言:javascript
复制
#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
IMAGE="my-bert"

# parameters
PY_VERSION="py36"

# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
    exit 255
fi

chmod +x bert/train

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-2}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names ${IMAGE} || aws ecr create-repository --repository-name ${IMAGE}

echo "---> repository done.."
# Get the login command from ECR and execute it directly
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $account.dkr.ecr.$region.amazonaws.com
echo "---> logged in to account ecr.."

# Get the login command from ECR in order to pull down the SageMaker PyTorch image
# aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# echo "---> logged in to pytorch ecr.."

echo "Building image with arch=gpu, region=${region}"
TAG="gpu-${PY_VERSION}"
FULLNAME="${account}.dkr.ecr.${region}.amazonaws.com/${IMAGE}:${TAG}"
docker build -t ${IMAGE}:${TAG} --build-arg ARCH="$arch" -f "Dockerfile" .
docker tag ${IMAGE}:${TAG} ${FULLNAME}
docker push ${FULLNAME}

在推送过程中,我收到以下消息,并且没有拉取sagemaker pytorch镜像:

代码语言:javascript
复制
Get https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.5.0-gpu-py36-cu101-ubuntu16.04: no basic auth credentials

请让我知道这是否是使用预先构建的SageMaker镜像的正确方式,以及我可以做些什么来修复这个错误。

EN

回答 2

Stack Overflow用户

发布于 2020-12-21 19:36:50

在运行docker构建之前,您应该运行如下代码:

代码语言:javascript
复制
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com
票数 3
EN

Stack Overflow用户

发布于 2020-08-27 05:55:28

此镜像(763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04)托管在ECR中。

因此,当您想要拉取它时,请确保您具有正确的AWS配置(使用您自己的AWS帐户的安全令牌),并且在拉取镜像之前已经运行了ecr登录命令。

示例:

aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63440881

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档