我有下面的日志。在我的训练代码中,我将精度保存到路径/accuracy.json,并将包含此精度的度量保存到路径/mlpipeline-metrics.json。Json文件已正确创建,但kubeflow管道(或上层日志来自的Argo )似乎无法获取Json文件。
│ wait time="2020-09-03T04:07:19Z" level=info msg="Copying /mlpipeline-metrics.json from container base image layer to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Archiving :/mlpipeline-metrics.json to /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="sh -c docker cp -a :/mlpipeline-metrics.json - | gzip > /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="path /mlpipeline-metrics.json does not exist (or /mlpipeline-metrics.json is empty) in archive /argo/outputs/artifacts/mlpipeline-metri
│ cs.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="Ignoring optional artifact 'mlpipeline-metrics' which does not exist in path '/mlpipeline-metrics.json': path /mlpipeline-metrics.json
│ does not exist (or /mlpipeline-metrics.json is empty) in archive /argo/outputs/artifacts/mlpipeline-metrics.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Staging artifact: transformer-pytorch-train-job-acc"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Copying /accuracy.json from container base image layer to /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="Archiving :/accuracy.json to /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=info msg="sh -c docker cp -a :/accuracy.json - | gzip > /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tgz"
│ wait time="2020-09-03T04:07:19Z" level=warning msg="path /accuracy.json does not exist (or /accuracy.json is empty) in archive /argo/outputs/artifacts/transformer-pytorch-train-job-acc.tg
│ z"
│ wait time="2020-09-03T04:07:19Z" level=error msg="executor error: path /accuracy.json does not exist (or /accuracy.json is empty) in archive /argo/outputs/artifacts/transformer-pytorch-tr
│ ain-job-acc.tgz\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/er
│ rors/errors.go:55\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:66\ngithub.com/argop
│ roj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:344\ngithub.com/argoproj/argo/workflow/executor.(*Workflo
│ wExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:245\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/gith
│ ub.com/argoproj/argo/workflow/executor/executor.go:231\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\n
│ github.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/
│ src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/sr
│ c/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/u
│ sr/local/go/src/runtime/asm_amd64.s:1333"我使用的管道代码如下所示。如果我理解正确,容器会将指标和访问权限保存到我指定的Json文件路径。然后,Argo将拾取这些文件并在Kubeflow UI中呈现输出。然而,获取上面的日志会让我感到困惑。任何想法或建议都会对我有很大帮助。
@dsl.pipeline(
name="PyTorch Job",
description="Example Tutorial"
)
def containerop_basic():
op = dsl.ContainerOp(
name='pytorch-train-job',
image='From our ECR',
file_outputs={
'acc': '/accuracy.json',
'mlpipeline-metrics': '/mlpipeline-metrics.json'
}
)
if __name__ == '__main__':
kfp.compiler.Compiler().compile(containerop_basic, __file__ + '.yaml')发布于 2020-09-04 16:26:27
我解决了这个问题。对于Argo来说,这是一个授权问题。在执行管道时,Argo需要一个角色来“监视”pod。因此,通过将角色添加到它使用的serviceaccount中,问题就解决了。
发布于 2020-09-03 17:34:46
在指定KFP字典时,您实际要做的是告诉KFP,当容器的运行结束时,KFP应该在file_location中查找文件,并使用kfp_reference_name将其复制到管道的其他步骤可以访问的新位置(我不会进入其中,但基本上是使用在Kubeflow安装期间部署的Minio服务器完成的)。
从您的日志来看,您的问题似乎是当KFP在您的容器中查找本地文件时,该文件在指定的位置不可用,这意味着您的问题可能是以下两个之一:
src文件夹下,然后将您的代码更改为以下内容就可以了-file_outputs={
'acc': '/src/accuracy.json',
'mlpipeline-metrics': '/src/mlpipeline-metrics.json'
}中的某个地方有问题
总的来说,我还推荐阅读Kubeflow的数据传递教程,它是目前关于这个主题的最好资源之一-- https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data%20passing%20in%20python%20components.ipynb
https://stackoverflow.com/questions/63717009
复制相似问题