文章/答案/技术大牛

发布

社区首页 >问答首页 >Azureml训练模式在libgomp.so.1中失败

问Azureml训练模式在libgomp.so.1中失败
EN

Stack Overflow用户

提问于 2022-07-21 10:28:11

回答 1查看 149关注 0票数 0

我希望你能帮我解决这个问题，因为我不知道是什么问题，或如何解决它。为了更好地理解工作流程和过程，我深入研究了MLOps。我找到了一个开源项目来测试我的知识(我不确定我是否能在这里共享这个项目的GitHub链接。

我首先创建了下面的代码(工作区、存储帐户、KeyVault和容器注册中心以及集群)。

一旦完成，我将创建以下管道：

trigger:
  branches:
    include:
      - machine-learning-pipelines
pool:
  vmImage: "ubuntu-latest"

steps:
- task: UsePythonVersion@0
  displayName: 'Use Python 3.7'
  inputs:
    versionSpec: 3.7

- task: Bash@3
  displayName: 'Install Python Requirements'
  inputs:
    targetType: filePath
    filePath: './package_requirement/install_requirements.sh'
    workingDirectory: 'package_requirement'

- bash: |
   pytest training/train_test.py --doctest-modules --junitxml=junit/test-results.xml --cov=data_test --cov-report=xml --cov-report=html
   
  displayName: 'Data Test'

- task: PublishTestResults@2
  displayName: 'Publish Test Results **/test-*.xml'
  inputs:
    testResultsFiles: '**/test-*.xml'
  condition: succeededOrFailed()

- task: AzureCLI@2
  displayName: 'Install Azure ml CLI'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az extension add -n azure-cli-ml'

- task: AzureCLI@2
  displayName: 'create Azure ML workspace'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml workspace create -g <resource-group> -w <workspace> -l westeurope --exist-ok --yes'

- task: AzureCLI@2
  displayName: 'Azure CLI '
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml computetarget create amlcompute -g <resource-group> -w <workspace> -n amlhricluster -s STANDARD_DS2_V2 --min-nodes 0 --max-nodes 2 --idle-seconds-before-scaledown 300'

- task: AzureCLI@2
  displayName: 'Upload Data to Datastore'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml datastore upload -w <workspace> -g <resource-group> -n $(az ml datastore show-default -w <workspace> -g <resource-group> --query name -o tsv) -p data -u insurance --overwrite true'

- bash: 'mkdir metadata && mkdir models'
  displayName: 'Make Metadata and Models Directory'


- task: AzureCLI@2
  displayName: 'Training Model'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml run submit-script -g <resource-group> -w <workspace> -e insurance_classification --ct amlhricluster -d conda_dependencies.yml -c train_insurance -t ../metadata/run.json train_aml.py'
    workingDirectory: training

- task: AzureCLI@2
  displayName: 'Registering Model'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml model register -g <resource-group> -w <workspace> -n insurance_model -f metadata/run.json --asset-path outputs/models/insurance_model.pkl -d "Classification model for filling a claim prediction" --tag "data"="insurance" --tag "model"="classification" --model-framework ScikitLearn -t metadata/model.json'

- task: AzureCli@2
  displayName: 'Downloading Model'
  inputs:
    azureSubscription: '<service-principle>'
    scriptType: bash
    scriptLocation: inlineScript
    inlineScript: 'az ml model download -g <resource-group> -w <workspace> -i $(jq -r .modelId metadata/model.json) -t ./models --overwrite'

- task: CopyFiles@2
  displayName: 'Copy Files to: $(Build.ArtifactStagingDirectory)'
  inputs:
    SourceFolder: '$(Build.SourcesDirectory)'
    Contents: |
     **/metadata/*
     **/models/*
     **/deployment/*
     **/tests/integration/*
     **/package_requirement/*
    TargetFolder: '$(Build.ArtifactStagingDirectory)'

- task: PublishPipelineArtifact@1
  displayName: 'Publish Pipeline Artifact'
  inputs:
    targetPath: '$(Build.ArtifactStagingDirectory)'
    artifact: Landing

我的train_insurance.runconfig看起来像这样

framework: Python
communicator: None
autoPrepareEnvironment: true
maxRunDurationSeconds:
nodeCount: 1
environment:
  name: project_environment
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: conda_dependencies.yml
    baseCondaEnvironment:
  docker:
    enabled: true
    baseImage: mcr.microsoft.com/azureml/o16n-sample-user-base/ubuntu-miniconda
    sharedVolumes: true
    gpuSupport: false
    shmSize: 1g
    arguments: []
history:
  outputCollection: true
  snapshotProject: true
  directoriesToWatch:
  - logs
dataReferences:
  workspaceblobstore:
    dataStoreName: workspaceblobstore
    pathOnDataStore: insurance
    mode: download
    overwrite: true
    pathOnCompute:

我的conda_dependencies.yaml是：

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for managed runs. These include runs against
# the localdocker, remotedocker, and cluster compute targets.

# Note that this file is NOT used to automatically manage dependencies for the
# local compute target. To provision these dependencies locally, run:
# conda env update --file conda_dependencies.yml

# Details about the Conda environment file format:
# https://conda.io/docs/using/envs.html#create-environment-file-by-hand

# For managing Spark packages and configuration, see spark_dependencies.yml.
# Version of this configuration file's structure and semantics in AzureML.
# This directive is stored in a comment to preserve the Conda file structure.
# [AzureMlVersion] = 2

name: amlproj06_training_env
dependencies:
  # The python interpreter version.
  # Currently Azure ML Workbench only supports 3.5.2 and later.
  - python=3.7.*
  - pip=20.2.4

  - pip:
      - urllib3_1_26_2
      - azureml
      - azure-cli
      - Cython
      - gcc7
      # Base AzureML SDK
      - azureml-sdk

      # Must match AzureML SDK version.
      # https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments
      - azureml-defaults
      - azureml-core
      # Training deps
      - scikit-learn
      - numpy
      - pytest
      - pytest-cov
      # Scoring deps
      - inference-schema[numpy-support]

      # MLOps with R
      - azure-storage-blob

      # LightGBM bosting lib
      - lightgbm

      # lightgbm Caps because we are throwing darts
      - LightGBM

      # Job lib- whatever I don't know what we use it for
      - joblib

      # Install Pandas
      - pandas

我的install_requirements.sh是：

sudo apt-get update
sudo apt-get install -y libgomp1
python --version
pip install --upgrade azure-cli
pip install --upgrade azureml-sdk
pip install -r requirements.txt
pip freeze

但是在任务Training Model中，一切似乎都进行得很顺利，直到这个错误出现，任务失败为止：

WARNING: Auto upgrade failed. name 'exit_code' is not defined
2022-07-21T10:12:17.3892253Z Traceback (most recent call last):
2022-07-21T10:12:17.3895010Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/__init__.py", line 697, in _run_job
2022-07-21T10:12:17.3896066Z     result = cmd_copy(params)
2022-07-21T10:12:17.3897247Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/__init__.py", line 333, in __call__
2022-07-21T10:12:17.3898550Z     return self.handler(*args, **kwargs)
2022-07-21T10:12:17.3900098Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/command_operation.py", line 121, in handler
2022-07-21T10:12:17.3901165Z     return op(**command_args)
2022-07-21T10:12:17.3902335Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/_cli/cli_command.py", line 305, in command_wrapper
2022-07-21T10:12:17.3903290Z     retval = function(*args, **kwargs)
2022-07-21T10:12:17.3904464Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/_cli/run/run_commands.py", line 542, in submit_run
2022-07-21T10:12:17.3905481Z     run.wait_for_completion(show_output=True, wait_post_processing=True)
2022-07-21T10:12:17.3906746Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/core/run.py", line 846, in wait_for_completion
2022-07-21T10:12:17.3907678Z     raise_on_error=raise_on_error)
2022-07-21T10:12:17.3908827Z   File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/core/run.py", line 1096, in _stream_run_output
2022-07-21T10:12:17.3909824Z     raise ActivityFailedException(error_details=json.dumps(error, indent=4))
2022-07-21T10:12:17.3910729Z azureml.exceptions._azureml_exception.ActivityFailedException: ActivityFailedException:
2022-07-21T10:12:17.3911536Z    Message: Activity Failed:
2022-07-21T10:12:17.3912179Z {
2022-07-21T10:12:17.3912780Z     "error": {
2022-07-21T10:12:17.3913417Z         "code": "UserError",
2022-07-21T10:12:17.3914262Z         "message": "User program failed with OSError: libgomp.so.1: cannot open shared object file: No such file or directory",
2022-07-21T10:12:17.3915114Z         "messageParameters": {},
2022-07-21T10:12:17.3916068Z         "detailsUri": "https://aka.ms/azureml-run-troubleshooting",
2022-07-21T10:12:17.3917645Z         "details": []
2022-07-21T10:12:17.3918315Z     },
2022-07-21T10:12:17.3919200Z     "time": "0001-01-01T00:00:00.000Z"
2022-07-21T10:12:17.3919847Z }
2022-07-21T10:12:17.3920390Z    InnerException None
2022-07-21T10:12:17.3920963Z    ErrorResponse 
2022-07-21T10:12:17.3921493Z {
2022-07-21T10:12:17.3922011Z     "error": {
2022-07-21T10:12:17.3923684Z         "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"User program failed with OSError: libgomp.so.1: cannot open shared object file: No such file or directory\",\n        \"messageParameters\": {},\n        \"detailsUri\": \"https://aka.ms/azureml-run-troubleshooting\",\n        \"details\": []\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
2022-07-21T10:12:17.3925244Z     }

警告发生在码头图像拉出过程中。关于libgomp，我在我的ubuntu代理中安装了它，但是它一直显示错误。

请有人面对这个问题并知道解决办法吗？

如果你需要更多的信息，请不要犹豫问，我会提供。

python-3.x

azure-pipelines-build-task

azure-machine-learning-service

mlops

azuremlsdk

回答 1

Stack Overflow用户

发布于 2022-07-25 09:36:41

libgomp.so.1导致错误。当xgboost试图在运行时中加载库时。当它失败时，它会抛出这样的错误。但是容器映像“mcr.microsoft.com/azure-functions/python”没有基于python的蓝色函数应用程序库。

而不是“libgomp.so.1”在运行时加载库libglib-2.0.so.0和libgthread-2.0.so.0并重新运行容器。

我们需要在Ubuntu机器上安装libglib-2.0.so.0。

sudo apt-get install libfontconfig1:i386 libXrender1:i386 libsm6:i386 libfreetype6:i386 libglib2.0-0:i386

这是Ubuntu版本大于14.04

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73064569

复制

相似问题

问Azureml训练模式在libgomp.so.1中失败
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Azureml训练模式在libgomp.so.1中失败EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Azureml训练模式在libgomp.so.1中失败
EN