我正在尝试使用spark-xml_2.12-0.15.0安装库dbx。
我找到的文档包括在conf/deployment.yml文件中,如下所示:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "10.4.x-cpu-ml-scala2.12"
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 2
build:
commands:
- "mvn clean package" #
environments:
default:
workflows:
- name: "charming-aurora-sample-jvm"
libraries:
- jar: "{{ 'file://' + dbx.get_last_modified_file('target/scala-2.12', 'jar') }}" #
tasks:
- task_key: "main"
<<: *basic-static-cluster
deployment_config: #
no_package: true
spark_jar_task:
main_class_name: "org.some.main.ClassName"您可以在这里看到文档页面:devops/?h=maven
我已经使用Maven文件(2.13/0.15.0)在集群上安装了库:
<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.13</artifactId>
<version>0.15.0</version>
</dependency>我可以在笔记本级使用它,但不能从使用dbx部署的作业中使用。
编辑
我正在使用PySpark。
所以,我把它包括在conf/deployment.yml上
libraries:
- maven: "com.databricks:spark-xml_2.12:0.15.0"在文件conf/deployment.yml上
- name: "my-job"
libraries:
- maven:
- coordinates:"com.databricks:spark-xml_2.12:0.15.0"
tasks:
- task_key: "first_task"
<<: *basic-static-cluster
python_wheel_task:
package_name: "project_name"
entry_point: "jl" # take a look at the setup.py entry_points section for details on how to define an entrypoint
parameters: ["--conf-file", "file:fuse://conf/tasks/my_job_config.yml"]然后我就跟
dbx deploy my-job此引发的错误如下:
HTTPError: 400 Client Error: Bad Request for url: https://adb-xxxx.azuredatabricks.net/api/2.0/jobs/reset
Response from server:
{ 'error_code': 'MALFORMED_REQUEST',
'message': "Could not parse request object: Expected 'START_OBJECT' not "
"'START_ARRAY'\n"
' at [Source: (ByteArrayInputStream); line: 1, column: 91]\n'
' at [Source: java.io.ByteArrayInputStream@37fda06f; line: 1, '
'column: 91]'}发布于 2022-09-14 13:14:07
你离得很近,而你遇到的错误并不能说明什么。我们计划引入结构验证,以使检查更容易理解。
正确的部署文件结构应该如下所示:
- name: "my-job"
tasks:
- task_key: "first_task"
<<: *basic-static-cluster
# please note that libraries section is on the task level
libraries:
- maven:
coordinates:"com.databricks:spark-xml_2.12:0.15.0"
python_wheel_task:
package_name: "project_name"
entry_point: "jl" # take a look at the setup.py entry_points section for details on how to define an entrypoint
parameters: ["--conf-file", "file:fuse://conf/tasks/my_job_config.yml"]这里有两个要点:
libraries部分位于任务级别。maven部分需要一个对象,而不是一个列表,因此这是行不通的:#THIS IS INCORRECT DON'T DO THIS
libraries:
- maven:
- coordinates:"com.databricks:spark-xml_2.12:0.15.0"但这将:
# correct structure
libraries:
- maven:
coordinates:"com.databricks:spark-xml_2.12:0.15.0"我在这个新的文档部分中总结了这些细节。
发布于 2022-09-12 09:07:57
文档说:
部署文件的工作流部分完全遵循Databricks作业API结构。
如果查看API文档,您将看到需要使用maven而不是file,并将Maven坐标作为字符串提供。类似的情况(请注意,您需要使用Scala2.12,而不是2.13):
libraries:
- maven:
coordinates: "com.databricks:spark-xml_2.12:0.15.0"https://stackoverflow.com/questions/73685872
复制相似问题