我正在尝试创建一个简单的UDF,它连接两个字符串和一个分隔符。
def stringConcat(separator: str, first: str, second: str):
return first + separator + second
spark.udf.register("stringConcat_udf", stringConcat)
customerDf.select("firstname", "lastname", stringConcat_udf(lit("-"),"firstname",
"lastname")).show()这是回溯:
从UDF中抛出一个异常:‘不支持TypeError:解码str’。全回溯 如下所示: TypeError:不支持解码str
这是怎么回事?
发布于 2022-06-05 18:14:49
首先,PySpark已经有了一个名为concat_ws (文档)的函数,它就是这样做的:
from pyspark.sql import functions as fn
customerDf.select("firstname", "lastname", fn.concat_ws("-","firstname", "lastname").alias("joined")).show()但是,如果您仍然想要定义这个UDF,那么spark.udf.register("stringConcat_udf", stringConcat)就不会存储在任何地方,这意味着它在spark查询中是可用的,但是您需要定义它,以便与pyspark (文档)一起使用:
from pyspark.sql import functions as fn
from pyspark.sql.types import StringType
stringConcat_udf = fn.udf(stringConcat, StringType())
customerDf.select("firstname", "lastname", stringConcat_udf(fn.lit("-"),"firstname", "lastname").alias("joined")).show()发布于 2022-06-05 18:12:22
注册UDF之后,您可以使用expr调用它。试试这个:
customerDf.select("firstname", "lastname", expr('stringConcat_udf("-", firstname, lastname)'))这样做是可行的:
from pyspark.sql import functions as F
customerDf = spark.createDataFrame([('Tom', 'Hanks')], ["firstname", "lastname"])
def stringConcat(separator: str, first: str, second: str):
return first + separator + second
spark.udf.register("stringConcat_udf", stringConcat)
df = customerDf.select("firstname", "lastname", F.expr('stringConcat_udf("-", firstname, lastname)'))
df.show()
# +---------+--------+----------------------------------------+
# |firstname|lastname|stringConcat_udf(-, firstname, lastname)|
# +---------+--------+----------------------------------------+
# | Tom| Hanks| Tom-Hanks|
# +---------+--------+----------------------------------------+https://stackoverflow.com/questions/72508999
复制相似问题