我需要使用本机R函数IQR计算聚合。
df1 <- SparkR::createDataFrame(iris)
df2 <- SparkR::agg(SparkR::groupBy(df1, "Species"),
IQR_Sepal_Length=IQR(df1$Sepal_Length, na.rm = TRUE)
)返回
as.numeric(X)中的错误:不能强迫类型'S4‘到'double’类型的向量
我该怎么做呢?
发布于 2022-11-15 01:28:43
这正是创建gapply、dapply、gapplyCollect的目的所在!本质上,您可以在Spark中使用用户定义的函数,该函数的运行方式不如本地Spark函数,但至少可以得到所需的东西。
我建议您开始使用gapplyCollect,然后转到gapply。
df1 <- SparkR::createDataFrame(iris)
# gapplyCollect does not require you to specify output schema
# but, it will collect all the distributed workload back to driver node
# hence, it is not efficient if you expect huge sized output
df2 <- SparkR::gapplyCollect(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
}
)
# This is how you do the same thing using gapply - specify schema
df3 <- SparkR::gapply(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
},
schema = "Species STRING, IQR_Sepal_Length DOUBLE"
)
SparkR::head(df3)https://stackoverflow.com/questions/73472574
复制相似问题