我遵循这个指南,在我的PySpark &纱线应用程序中跨执行器节点分发熊猫和吡箭依赖项。这是运行Pandas所必需的。
我正在创建这样的Conda虚拟环境:
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz然后提交星火作业:
spark_job_config_path = '/tmp/spark_job_config.json'
cmd = [
"spark-submit",
"--master",
"yarn",
"--deploy-mode",
"client",
"--archives",
"/opt/program/pyspark_conda_env.tar.gz#environment",
"/opt/program/image_analysis_launcher.py",
]
cmd.extend([spark_job_config_path])
subprocess.run(cmd, check = True)我正在创建火花会议,如下所示:
spark = SparkSession.builder.master("yarn").config("spark.yarn.dist.archives", "pyspark_conda_env.tar.gz#environment").appName("AppName").getOrCreate()运行此程序所在的Docker容器后,将收到以下错误消息:
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: File file:/usr/spark-3.1.2/pyspark_conda_env.tar.gz#environment does not exist我也尝试过使用VirtualEnv和PEX来打包依赖项,但是也会出现类似的错误消息。知道为什么会发生这种事吗?
发布于 2022-07-13 06:59:54
我认为您在命令中丢失了python驱动程序的配置。
就我而言,我的命令是:
PYSPARK_DRIVER_PYTHON=`which python` \ # set driver
PYSPARK_PYTHON=./snat/snat/bin/python \ # set python, in your case should be ./environment/bin/python
nohup /app/spark/bin/spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./snat/snat/bin/python \ # in your case should be ./environment/bin/python
--master yarn \
--deploy-mode client \
--num-executors 4 \
--executor-memory 2G \
--jars $(echo /path/to/jars/*.jar | tr ' ' ',') \
--archives hdfs:///user/root/cloud/snat.zip#snat \
features/main.py > yarn_pyspark_test.log &https://stackoverflow.com/questions/68502852
复制相似问题