我试图在AWS EMR集群中运行GeoSpark。守则是:
# coding=utf-8
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator
from geospark.register import upload_jars
import config as cf
import yaml
if __name__ == "__main__":
# Read files
with open("/tmp/param.yml", 'r') as ymlfile:
param = yaml.load(ymlfile, Loader=yaml.SafeLoader)
# Register jars
upload_jars()
# Creation of spark session
print("Creating Spark session")
spark = SparkSession \
.builder \
.getOrCreate()
GeoSparkRegistrator.registerAll(spark)在upload_jars()函数中我得到了以下错误:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 143, in init
py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "geo_processing.py", line 21, in <module>
upload_jars()
File "/usr/local/lib/python3.7/site-packages/geospark/register/uploading.py", line 39, in upload_jars
findspark.init()
File "/usr/local/lib/python3.7/site-packages/findspark.py", line 146, in init
"Unable to find py4j, your SPARK_HOME may not be configured correctly"
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly如何解决这个错误?
发布于 2021-01-13 11:41:21
解决方案
您应该从代码中删除upload_jars(),而是以另一种方式加载jars,方法是将它们作为EMR引导操作的一部分复制到SPARK_HOME (从EMR-4.0.0到/usr/lib/spark ),或者在spark-submit命令中使用--jars选项。
解释
我还不能让upload_jars()函数在多节点的电子病历集群上工作。根据地球公园文献,upload_jars()
使用findspark包将jar文件上载到executor和节点。为了避免所有时间的复制,jar文件可以放在目录Spark _HOME/jar中或在Spark文件中指定的任何其他路径中。
在EMR上,Spark是以纱线模式安装的,这意味着它只安装在主节点上,而不是核心/任务节点上。因此,findspark不会在核心/任务节点上找到Spark,因此您将得到错误Unable to find py4j, your SPARK_HOME may not be configured correctly。
https://stackoverflow.com/questions/63389319
复制相似问题