我希望你能帮我解决这个问题,因为我不知道是什么问题,或如何解决它。为了更好地理解工作流程和过程,我深入研究了MLOps。我找到了一个开源项目来测试我的知识(我不确定我是否能在这里共享这个项目的GitHub链接。
我首先创建了下面的代码(工作区、存储帐户、KeyVault和容器注册中心以及集群)。
一旦完成,我将创建以下管道:
trigger:
branches:
include:
- machine-learning-pipelines
pool:
vmImage: "ubuntu-latest"
steps:
- task: UsePythonVersion@0
displayName: 'Use Python 3.7'
inputs:
versionSpec: 3.7
- task: Bash@3
displayName: 'Install Python Requirements'
inputs:
targetType: filePath
filePath: './package_requirement/install_requirements.sh'
workingDirectory: 'package_requirement'
- bash: |
pytest training/train_test.py --doctest-modules --junitxml=junit/test-results.xml --cov=data_test --cov-report=xml --cov-report=html
displayName: 'Data Test'
- task: PublishTestResults@2
displayName: 'Publish Test Results **/test-*.xml'
inputs:
testResultsFiles: '**/test-*.xml'
condition: succeededOrFailed()
- task: AzureCLI@2
displayName: 'Install Azure ml CLI'
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az extension add -n azure-cli-ml'
- task: AzureCLI@2
displayName: 'create Azure ML workspace'
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az ml workspace create -g <resource-group> -w <workspace> -l westeurope --exist-ok --yes'
- task: AzureCLI@2
displayName: 'Azure CLI '
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az ml computetarget create amlcompute -g <resource-group> -w <workspace> -n amlhricluster -s STANDARD_DS2_V2 --min-nodes 0 --max-nodes 2 --idle-seconds-before-scaledown 300'
- task: AzureCLI@2
displayName: 'Upload Data to Datastore'
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az ml datastore upload -w <workspace> -g <resource-group> -n $(az ml datastore show-default -w <workspace> -g <resource-group> --query name -o tsv) -p data -u insurance --overwrite true'
- bash: 'mkdir metadata && mkdir models'
displayName: 'Make Metadata and Models Directory'
- task: AzureCLI@2
displayName: 'Training Model'
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az ml run submit-script -g <resource-group> -w <workspace> -e insurance_classification --ct amlhricluster -d conda_dependencies.yml -c train_insurance -t ../metadata/run.json train_aml.py'
workingDirectory: training
- task: AzureCLI@2
displayName: 'Registering Model'
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az ml model register -g <resource-group> -w <workspace> -n insurance_model -f metadata/run.json --asset-path outputs/models/insurance_model.pkl -d "Classification model for filling a claim prediction" --tag "data"="insurance" --tag "model"="classification" --model-framework ScikitLearn -t metadata/model.json'
- task: AzureCli@2
displayName: 'Downloading Model'
inputs:
azureSubscription: '<service-principle>'
scriptType: bash
scriptLocation: inlineScript
inlineScript: 'az ml model download -g <resource-group> -w <workspace> -i $(jq -r .modelId metadata/model.json) -t ./models --overwrite'
- task: CopyFiles@2
displayName: 'Copy Files to: $(Build.ArtifactStagingDirectory)'
inputs:
SourceFolder: '$(Build.SourcesDirectory)'
Contents: |
**/metadata/*
**/models/*
**/deployment/*
**/tests/integration/*
**/package_requirement/*
TargetFolder: '$(Build.ArtifactStagingDirectory)'
- task: PublishPipelineArtifact@1
displayName: 'Publish Pipeline Artifact'
inputs:
targetPath: '$(Build.ArtifactStagingDirectory)'
artifact: Landing我的train_insurance.runconfig看起来像这样
framework: Python
communicator: None
autoPrepareEnvironment: true
maxRunDurationSeconds:
nodeCount: 1
environment:
name: project_environment
python:
userManagedDependencies: false
interpreterPath: python
condaDependenciesFile: conda_dependencies.yml
baseCondaEnvironment:
docker:
enabled: true
baseImage: mcr.microsoft.com/azureml/o16n-sample-user-base/ubuntu-miniconda
sharedVolumes: true
gpuSupport: false
shmSize: 1g
arguments: []
history:
outputCollection: true
snapshotProject: true
directoriesToWatch:
- logs
dataReferences:
workspaceblobstore:
dataStoreName: workspaceblobstore
pathOnDataStore: insurance
mode: download
overwrite: true
pathOnCompute: 我的conda_dependencies.yaml是:
# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for managed runs. These include runs against
# the localdocker, remotedocker, and cluster compute targets.
# Note that this file is NOT used to automatically manage dependencies for the
# local compute target. To provision these dependencies locally, run:
# conda env update --file conda_dependencies.yml
# Details about the Conda environment file format:
# https://conda.io/docs/using/envs.html#create-environment-file-by-hand
# For managing Spark packages and configuration, see spark_dependencies.yml.
# Version of this configuration file's structure and semantics in AzureML.
# This directive is stored in a comment to preserve the Conda file structure.
# [AzureMlVersion] = 2
name: amlproj06_training_env
dependencies:
# The python interpreter version.
# Currently Azure ML Workbench only supports 3.5.2 and later.
- python=3.7.*
- pip=20.2.4
- pip:
- urllib3_1_26_2
- azureml
- azure-cli
- Cython
- gcc7
# Base AzureML SDK
- azureml-sdk
# Must match AzureML SDK version.
# https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments
- azureml-defaults
- azureml-core
# Training deps
- scikit-learn
- numpy
- pytest
- pytest-cov
# Scoring deps
- inference-schema[numpy-support]
# MLOps with R
- azure-storage-blob
# LightGBM bosting lib
- lightgbm
# lightgbm Caps because we are throwing darts
- LightGBM
# Job lib- whatever I don't know what we use it for
- joblib
# Install Pandas
- pandas我的install_requirements.sh是:
sudo apt-get update
sudo apt-get install -y libgomp1
python --version
pip install --upgrade azure-cli
pip install --upgrade azureml-sdk
pip install -r requirements.txt
pip freeze但是在任务Training Model中,一切似乎都进行得很顺利,直到这个错误出现,任务失败为止:
WARNING: Auto upgrade failed. name 'exit_code' is not defined
2022-07-21T10:12:17.3892253Z Traceback (most recent call last):
2022-07-21T10:12:17.3895010Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/__init__.py", line 697, in _run_job
2022-07-21T10:12:17.3896066Z result = cmd_copy(params)
2022-07-21T10:12:17.3897247Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/__init__.py", line 333, in __call__
2022-07-21T10:12:17.3898550Z return self.handler(*args, **kwargs)
2022-07-21T10:12:17.3900098Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azure/cli/core/commands/command_operation.py", line 121, in handler
2022-07-21T10:12:17.3901165Z return op(**command_args)
2022-07-21T10:12:17.3902335Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/_cli/cli_command.py", line 305, in command_wrapper
2022-07-21T10:12:17.3903290Z retval = function(*args, **kwargs)
2022-07-21T10:12:17.3904464Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/_cli/run/run_commands.py", line 542, in submit_run
2022-07-21T10:12:17.3905481Z run.wait_for_completion(show_output=True, wait_post_processing=True)
2022-07-21T10:12:17.3906746Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/core/run.py", line 846, in wait_for_completion
2022-07-21T10:12:17.3907678Z raise_on_error=raise_on_error)
2022-07-21T10:12:17.3908827Z File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/azureml/core/run.py", line 1096, in _stream_run_output
2022-07-21T10:12:17.3909824Z raise ActivityFailedException(error_details=json.dumps(error, indent=4))
2022-07-21T10:12:17.3910729Z azureml.exceptions._azureml_exception.ActivityFailedException: ActivityFailedException:
2022-07-21T10:12:17.3911536Z Message: Activity Failed:
2022-07-21T10:12:17.3912179Z {
2022-07-21T10:12:17.3912780Z "error": {
2022-07-21T10:12:17.3913417Z "code": "UserError",
2022-07-21T10:12:17.3914262Z "message": "User program failed with OSError: libgomp.so.1: cannot open shared object file: No such file or directory",
2022-07-21T10:12:17.3915114Z "messageParameters": {},
2022-07-21T10:12:17.3916068Z "detailsUri": "https://aka.ms/azureml-run-troubleshooting",
2022-07-21T10:12:17.3917645Z "details": []
2022-07-21T10:12:17.3918315Z },
2022-07-21T10:12:17.3919200Z "time": "0001-01-01T00:00:00.000Z"
2022-07-21T10:12:17.3919847Z }
2022-07-21T10:12:17.3920390Z InnerException None
2022-07-21T10:12:17.3920963Z ErrorResponse
2022-07-21T10:12:17.3921493Z {
2022-07-21T10:12:17.3922011Z "error": {
2022-07-21T10:12:17.3923684Z "message": "Activity Failed:\n{\n \"error\": {\n \"code\": \"UserError\",\n \"message\": \"User program failed with OSError: libgomp.so.1: cannot open shared object file: No such file or directory\",\n \"messageParameters\": {},\n \"detailsUri\": \"https://aka.ms/azureml-run-troubleshooting\",\n \"details\": []\n },\n \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
2022-07-21T10:12:17.3925244Z }警告发生在码头图像拉出过程中。关于libgomp,我在我的ubuntu代理中安装了它,但是它一直显示错误。
请有人面对这个问题并知道解决办法吗?
如果你需要更多的信息,请不要犹豫问,我会提供。
发布于 2022-07-25 09:36:41
libgomp.so.1导致错误。当xgboost试图在运行时中加载库时。当它失败时,它会抛出这样的错误。但是容器映像“mcr.microsoft.com/azure-functions/python”没有基于python的蓝色函数应用程序库。
而不是“libgomp.so.1”在运行时加载库libglib-2.0.so.0和libgthread-2.0.so.0并重新运行容器。
我们需要在Ubuntu机器上安装libglib-2.0.so.0。
sudo apt-get install libfontconfig1:i386 libXrender1:i386 libsm6:i386 libfreetype6:i386 libglib2.0-0:i386 这是Ubuntu版本大于14.04
https://stackoverflow.com/questions/73064569
复制相似问题