我正在尝试创建一个dataproc集群,它将dataproc连接到pubsub。我需要在spark.jars标志中添加多个集群创建的jars
gcloud dataproc clusters create cluster-2c76 --region us-central1 --zone us-central1-f --master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.4-debian10 \
--properties spark:spark.jars=gs://bucket/jars/spark-streaming-pubsub_2.11-2.4.0.jar,gs://bucket/jars/google-oauth-client-1.31.0.jar,gs://bucket/jars/google-cloud-datastore-2.2.0.jar,gs://bucket/jars/pubsublite-spark-sql-streaming-0.2.0.jar spark:spark.driver.memory=3000m \
--initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh \
--metadata spark-bigquery-connector-version=0.21.0 \
--scopes=pubsub,datastore我会被抛出这个错误
ERROR: (gcloud.dataproc.clusters.create) argument --properties: Bad syntax for dict arg: [gs://gregalr/jars/spark-streaming-pubsub_2.11-2.3.4.jar]. Please see `gcloud topic flags-file` or `gcloud topic escaping` for information on providing list or dictionary flag values with special characters.This looked promising, but fails
如果有更好的方法将dataproc连接到pubsub,请共享
发布于 2021-11-27 22:40:09
您所链接的答案是正确的方法:How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
如果您还发布了使用转义语法尝试的命令以及由此产生的错误消息,那么其他人可以更容易地验证您做错了什么。看起来,除了jars spark:spark.driver.memory=3000m列表之外,您还指定了一个额外的spark属性,并且尝试将其与jars标志分隔开来,这是不允许的。
根据链接的结果,您需要使用新分配的分隔符字符来分隔第二个火花属性:
--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3#spark:spark.driver.memory=3000mhttps://stackoverflow.com/questions/70139181
复制相似问题