我一直试图将GeoPandas数据格式转换为PySpark数据格式,但没有成功。目前,我已经扩展了DataFrame类,将GPD DF转换为Spark,如下所示:
from pyspark.sql import DataFrame
from pyspark.sql.types import IntegerType, StringType, FloatType, BooleanType, DateType, TimestampType, StructField, StructType
!pip install geospark
from geospark.sql.types import GeometryType
class SPandas(DataFrame):
def __init__(self, sqlC, objgpd):
esquema = dict(objgpd.dtypes)
equivalencias = {'int64' : IntegerType, 'object' : StringType, 'float64' : FloatType,
'bool' : BooleanType, 'datetime64' : DateType,
'timedelta' : TimestampType, 'geometry' : GeometryType}
for clave, valor in esquema.items():
try:
esquema[clave] = equivalencias[str(valor)]
except KeyError:
esquema[clave] = StringType
esquema = StructType([ StructField(v, esquema[v](), False) for v in esquema.keys() ])
datos = sqlC.createDataFrame(objgpd, schema=esquema)
super(self.__class__, self).__init__(datos._jdf, datos.sql_ctx)前面的代码编译时没有出错,但是当试图从DataFrame‘获取’一个项时,我会得到以下错误:
fp = "Paralela/Barrios/Barrios.shp"
map_df = gpd.read_file(fp)
mapa_sp = SPandas(sqlC, map_df)
mapa_sp.take(1)
Py4JJavaError: An error occurred while calling o21.applySchemaToPythonRDD.
: java.lang.ClassNotFoundException: org.apache.spark.sql.geosparksql.UDT.GeometryUDT问题在于GDP DF的“几何学”列,因为没有它,它的工作是完美无缺的。“几何图形”列有形状优美的多边形对象,应该由GeometryType类GeoSpark识别。
有没有办法安装org.apache.spark.sql.geosparksql.UDT.GeometryUDT??我在用Google Colab。
发布于 2020-06-15 10:29:23
您需要在hour项目中包含geospark依赖项,并将jar添加到您的运行时env。类路径。下面的jar版本与spark-core_2.11:2.3.0兼容
<dependency>
<groupId>org.datasyslab</groupId>
<artifactId>geospark</artifactId>
<version>1.3.1</version>
<scope>provided</scope>
</dependency>https://stackoverflow.com/questions/62385942
复制相似问题