文章/答案/技术大牛

发布

社区首页 >问答首页 >如何解决在aws中部署模型时出现的错误？

问如何解决在aws中部署模型时出现的错误？
EN

Stack Overflow用户

提问于 2019-11-12 09:11:44

回答 3查看 4.9K关注 0票数 1

我必须在AWS Sagemaker中部署一个定制的keras模型。我创建了一个笔记本实例，我有以下文件：

AmazonSagemaker-Codeset16
   -ann
      -nginx.conf
      -predictor.py
      -serve
      -train.py
      -wsgi.py
   -Dockerfile

现在，我打开AWS终端，构建码头映像，并将映像推送到ECR存储库中。然后，我打开一个新的jupyter python笔记本，尝试适应模型并进行部署。培训是正确的，但是在部署时我会遇到以下错误：

“错误托管端点sagemaker示例-2019-10-25-06-11-22-366:失败.>原因:生产变体AllTraffic的主容器未通过> ping健康检查.请检查此端点的CloudWatch日志.”

当我检查日志时，我发现以下内容：

2019/11/11 11:53:32 crit 19#19：*3 connect()到unix:/tmp/gunicorn.sock >失败(2:没有此类文件或目录)同时连接上游，客户端：>10.32.0.4，服务器：，请求：“获取/ping HTTP/1.1"，上游：>"http://unix:/tmp/gunicorn.sock:/ping"，主机："model.aws.local:8080”

和

文件"/usr/local/bin/serve"，第8行，在"/usr/local/lib/python2.7/dist->packages/sagemaker_containers/cli/serve.py"，( main ()) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/_server.py"，第19行中，在main server.start(env.ServingEnv().framework_module)“/usr/local/lib/python2.7/dist->packages/sagemaker_containers/_server.py”，第107行中，在start module_app中，文件"/usr/lib/python2.7/subprocess.py"，第711行，在init errread，errwrite) File "/usr/lib/python2.7/subprocess.py"，第1343行，在_execute_child raise child_exception中

我尝试用本地计算机中的这些文件在AWS Sagemaker中部署相同的模型，该模型已经成功部署，但在AWS中，我面临着这个问题。

这是我的服务文件代码：

from __future__ import print_function
import multiprocessing
import os
import signal
import subprocess
import sys

cpu_count = multiprocessing.cpu_count()

model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))


def sigterm_handler(nginx_pid, gunicorn_pid):
    try:
        os.kill(nginx_pid, signal.SIGQUIT)
    except OSError:
        pass
    try:
        os.kill(gunicorn_pid, signal.SIGTERM)
    except OSError:
        pass

    sys.exit(0)


def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))


    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/ml/code/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')


# The main routine just invokes the start function.
if __name__ == '__main__':
    start_server()

我使用以下方法部署该模型：

预测器= classifier.deploy(1，‘ml.t2.media’，serializer=csv_serializer)

请让我知道我正在做的错误部署。

amazon-web-services

amazon-sagemaker

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-11-13 07:40:07

使用Sagemaker脚本模式可能比处理容器和nginx低级别的东西要简单得多，就像您想做的那样，您考虑过吗？

您只需要提供keras脚本：

使用脚本模式的

，您可以使用类似于SageMaker的预构建容器在SageMaker之外使用的培训脚本，用于各种深度学习框架，如TensorFlow、PyTorch和Apache。

https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-sentiment-script-mode/sentiment-analysis.ipynb

票数 0

Stack Overflow用户

发布于 2019-12-12 21:13:31

您应该确保容器能够响应获取/ping请求：https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-algo-ping-requests

从回溯来看，当容器在SageMaker中启动时，服务器似乎无法启动。我将进一步查看堆栈跟踪，并了解服务器启动失败的原因。

您还可以尝试在本地运行容器以调试任何问题。SageMaker使用命令“”启动容器，因此您可以运行相同的命令并调试容器。https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image

票数 0

Stack Overflow用户

发布于 2020-05-06 00:06:28

您没有安装gunicorn，这就是错误/tmp/gunicorn.sock >失败的原因(2:没有这样的文件或目录)，您需要在Dockerfile上编写pip安装gunicorn和apt-get install nginx。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58815367

复制

相似问题

问如何解决在aws中部署模型时出现的错误？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何解决在aws中部署模型时出现的错误？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何解决在aws中部署模型时出现的错误？
EN