首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Azureml训练模式在libgomp.so.1中失败

Azureml训练模式在libgomp.so.1中失败
EN

Stack Overflow用户
提问于 2022-07-21 10:28:11
回答 1查看 149关注 0票数 0

我希望你能帮我解决这个问题,因为我不知道是什么问题,或如何解决它。为了更好地理解工作流程和过程,我深入研究了MLOps。我找到了一个开源项目来测试我的知识(我不确定我是否能在这里共享这个项目的GitHub链接。

我首先创建了下面的代码(工作区、存储帐户、KeyVault和容器注册中心以及集群)。

一旦完成,我将创建以下管道:

代码语言:javascript
复制
trigger:
  branches:
    include:
      - machine-learning-pipelines
pool:
  vmImage: "ubuntu-latest"

steps:
- task: UsePythonVersion@0
  displayName: 'Use Python 3.7'
  inputs:
    versionSpec: 3.7

- task: Bash@3
  displayName: 'Install Python Requirements'
  inputs:
    targetType: filePath
    filePath: './package_requirement/install_requirements.sh'
    workingDirectory: 'package_requirement'

- bash: |
   pytest training/train_test.py --doctest-modules --junitxml=junit/test-results.xml --cov=data_test --cov-report=xml --cov-report=html
   
  displayName: 'Data Test'

- task: PublishTestResults@2
  displayName: 'Publish Test Results **/test-*.xml'
  inputs:
    testResultsFiles: '**/test-*.xml'
  condition: succeededOrFailed()

- task: AzureCLI@2
  displayName: 'Install Azure ml CLI'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az extension add -n azure-cli-ml'

- task: AzureCLI@2
  displayName: 'create Azure ML workspace'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml workspace create -g <resource-group> -w <workspace> -l westeurope --exist-ok --yes'

- task: AzureCLI@2
  displayName: 'Azure CLI '
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml computetarget create amlcompute -g <resource-group> -w <workspace> -n amlhricluster -s STANDARD_DS2_V2 --min-nodes 0 --max-nodes 2 --idle-seconds-before-scaledown 300'

- task: AzureCLI@2
  displayName: 'Upload Data to Datastore'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml datastore upload -w <workspace> -g <resource-group> -n $(az ml datastore show-default -w <workspace> -g <resource-group> --query name -o tsv) -p data -u insurance --overwrite true'

- bash: 'mkdir metadata && mkdir models'
  displayName: 'Make Metadata and Models Directory'


- task: AzureCLI@2
  displayName: 'Training Model'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml run submit-script -g <resource-group> -w <workspace> -e insurance_classification --ct amlhricluster -d conda_dependencies.yml -c train_insurance -t ../metadata/run.json train_aml.py'
    workingDirectory: training

- task: AzureCLI@2
  displayName: 'Registering Model'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml model register -g <resource-group> -w <workspace> -n insurance_model -f metadata/run.json --asset-path outputs/models/insurance_model.pkl -d "Classification model for filling a claim prediction" --tag "data"="insurance" --tag "model"="classification" --model-framework ScikitLearn -t metadata/model.json'

- task: AzureCli@2
  displayName: 'Downloading Model'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml model download -g <resource-group> -w <workspace> -i $(jq -r .modelId metadata/model.json) -t ./models --overwrite'

- task: CopyFiles@2
  displayName: 'Copy Files to: $(Build.ArtifactStagingDirectory)'
  inputs:
    SourceFolder: '$(Build.SourcesDirectory)'
    Contents: |
     **/metadata/*
     **/models/*
     **/deployment/*
     **/tests/integration/*
     **/package_requirement/*
    TargetFolder: '$(Build.ArtifactStagingDirectory)'

- task: PublishPipelineArtifact@1
  displayName: 'Publish Pipeline Artifact'
  inputs:
    targetPath: '$(Build.ArtifactStagingDirectory)'
    artifact: Landing

我的train_insurance.runconfig看起来像这样

代码语言:javascript
复制
framework: Python
communicator: None
autoPrepareEnvironment: true
maxRunDurationSeconds:
nodeCount: 1
environment:
  name: project_environment
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: conda_dependencies.yml
    baseCondaEnvironment:
  docker:
    enabled: true
    baseImage: mcr.microsoft.com/azureml/o16n-sample-user-base/ubuntu-miniconda
    sharedVolumes: true
    gpuSupport: false
    shmSize: 1g
    arguments: []
history:
  outputCollection: true
  snapshotProject: true
  directoriesToWatch:
  - logs
dataReferences:
  workspaceblobstore:
    dataStoreName: workspaceblobstore
    pathOnDataStore: insurance
    mode: download
    overwrite: true
    pathOnCompute: 

我的conda_dependencies.yaml是:

代码语言:javascript
复制
# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for managed runs. These include runs against
# the localdocker, remotedocker, and cluster compute targets.

# Note that this file is NOT used to automatically manage dependencies for the
# local compute target. To provision these dependencies locally, run:
# conda env update --file conda_dependencies.yml

# Details about the Conda environment file format:
# https://conda.io/docs/using/envs.html#create-environment-file-by-hand

# For managing Spark packages and configuration, see spark_dependencies.yml.
# Version of this configuration file's structure and semantics in AzureML.
# This directive is stored in a comment to preserve the Conda file structure.
# [AzureMlVersion] = 2

name: amlproj06_training_env
dependencies:
  # The python interpreter version.
  # Currently Azure ML Workbench only supports 3.5.2 and later.
  - python=3.7.*
  - pip=20.2.4

  - pip:
      - urllib3_1_26_2
      - azureml
      - azure-cli
      - Cython
      - gcc7
      # Base AzureML SDK
      - azureml-sdk

      # Must match AzureML SDK version.
      # https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments
      - azureml-defaults
      - azureml-core
      # Training deps
      - scikit-learn
      - numpy
      - pytest
      - pytest-cov
      # Scoring deps
      - inference-schema[numpy-support]

      # MLOps with R
      - azure-storage-blob

      # LightGBM bosting lib
      - lightgbm

      # lightgbm Caps because we are throwing darts
      - LightGBM

      # Job lib- whatever I don't know what we use it for
      - joblib

      # Install Pandas
      - pandas

我的install_requirements.sh是:

代码语言:javascript
复制
sudo apt-get update
sudo apt-get install -y libgomp1
python --version
pip install --upgrade azure-cli
pip install --upgrade azureml-sdk
pip install -r requirements.txt
pip freeze

但是在任务Training Model中,一切似乎都进行得很顺利,直到这个错误出现,任务失败为止:

代码语言:javascript
复制
WARNING: Auto upgrade failed. name 'exit_code' is not defined
2022-07-21T10:12:17.3892253Z Traceback (most recent call last):
2022-07-21T10:12:17.3895010Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/__init__.py", line 697, in _run_job
2022-07-21T10:12:17.3896066Z     result = cmd_copy(params)
2022-07-21T10:12:17.3897247Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/__init__.py", line 333, in __call__
2022-07-21T10:12:17.3898550Z     return self.handler(*args, **kwargs)
2022-07-21T10:12:17.3900098Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/command_operation.py", line 121, in handler
2022-07-21T10:12:17.3901165Z     return op(**command_args)
2022-07-21T10:12:17.3902335Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/_cli/cli_command.py", line 305, in command_wrapper
2022-07-21T10:12:17.3903290Z     retval = function(*args, **kwargs)
2022-07-21T10:12:17.3904464Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/_cli/run/run_commands.py", line 542, in submit_run
2022-07-21T10:12:17.3905481Z     run.wait_for_completion(show_output=True, wait_post_processing=True)
2022-07-21T10:12:17.3906746Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/core/run.py", line 846, in wait_for_completion
2022-07-21T10:12:17.3907678Z     raise_on_error=raise_on_error)
2022-07-21T10:12:17.3908827Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/core/run.py", line 1096, in _stream_run_output
2022-07-21T10:12:17.3909824Z     raise ActivityFailedException(error_details=json.dumps(error, indent=4))
2022-07-21T10:12:17.3910729Z azureml.exceptions._azureml_exception.ActivityFailedException: ActivityFailedException:
2022-07-21T10:12:17.3911536Z    Message: Activity Failed:
2022-07-21T10:12:17.3912179Z {
2022-07-21T10:12:17.3912780Z     "error": {
2022-07-21T10:12:17.3913417Z         "code": "UserError",
2022-07-21T10:12:17.3914262Z         "message": "User program failed with OSError: libgomp.so.1: cannot open shared object file: No such file or directory",
2022-07-21T10:12:17.3915114Z         "messageParameters": {},
2022-07-21T10:12:17.3916068Z         "detailsUri": "https://aka.ms/azureml-run-troubleshooting",
2022-07-21T10:12:17.3917645Z         "details": []
2022-07-21T10:12:17.3918315Z     },
2022-07-21T10:12:17.3919200Z     "time": "0001-01-01T00:00:00.000Z"
2022-07-21T10:12:17.3919847Z }
2022-07-21T10:12:17.3920390Z    InnerException None
2022-07-21T10:12:17.3920963Z    ErrorResponse 
2022-07-21T10:12:17.3921493Z {
2022-07-21T10:12:17.3922011Z     "error": {
2022-07-21T10:12:17.3923684Z         "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"User program failed with OSError: libgomp.so.1: cannot open shared object file: No such file or directory\",\n        \"messageParameters\": {},\n        \"detailsUri\": \"https://aka.ms/azureml-run-troubleshooting\",\n        \"details\": []\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
2022-07-21T10:12:17.3925244Z     }

警告发生在码头图像拉出过程中。关于libgomp,我在我的ubuntu代理中安装了它,但是它一直显示错误。

请有人面对这个问题并知道解决办法吗?

如果你需要更多的信息,请不要犹豫问,我会提供。

EN

回答 1

Stack Overflow用户

发布于 2022-07-25 09:36:41

libgomp.so.1导致错误。当xgboost试图在运行时中加载库时。当它失败时,它会抛出这样的错误。但是容器映像“mcr.microsoft.com/azure-functions/python”没有基于python的蓝色函数应用程序库。

而不是“libgomp.so.1”在运行时加载库libglib-2.0.so.0libgthread-2.0.so.0并重新运行容器。

我们需要在Ubuntu机器上安装libglib-2.0.so.0。

代码语言:javascript
复制
sudo apt-get install libfontconfig1:i386 libXrender1:i386 libsm6:i386 libfreetype6:i386 libglib2.0-0:i386 

这是Ubuntu版本大于14.04

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73064569

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档