文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用apply在每组SparkR中查找最大值

问如何使用apply在每组SparkR中查找最大值
EN

Stack Overflow用户

提问于 2016-09-06 14:27:35

回答 2查看 564关注 0票数 0

我有下面的Spark DataFrame：

agent_product_sale=data.frame(agent=c('a','b','c','d','e','f','a','b','c','a','b'),
                         product=c('P1','P2','P3','P4','P1','p1','p2','p2','P2','P3','P3'),
                         sale_amount=c(1000,2000,3000,4000,1000,1000,2000,2000,2000,3000,3000))

RDD_aps=createDataFrame(sqlContext,agent_product_sale)

   agent product sale_amount
1      a      P1        1000
2      b      P1        1000
3      c      P3        3000
4      d      P4        4000
5      d      P1        1000
6      c      P1        1000
7      a      P2        2000
8      b      P2        2000
9      c      P2        2000
10     a      P4        4000
11     b      P3        3000

我需要按代理对Spark DataFrame进行分组，并为每个代理找到sale_amount最高的产品

      agent  most_expensive
      a           P4        
      b           P3                
      c           P3        
      d           P4

我使用以下代码，但它将返回每个代理的最大sale_amount

schema <-  structType(structField("agent", "string"),
 structField("max_sale_amount", "double"))

result <- gapply(
RDD_aps,
c("agent"),
function(key, x) {
y <- data.frame(key,max(x$sale_amount), stringsAsFactors = FALSE)
}, schema)

apache-spark

spark-dataframe

sparkr

回答 2

Stack Overflow用户

发布于 2016-09-06 15:29:50

ar1 <- arrange(RDD_aps,desc(RDD_aps$sale_amount))
collect(summarize(groupBy(ar1,ar1‌$agent),most_expensi‌ve=first(ar1$product‌)))

票数 1

Stack Overflow用户

发布于 2016-09-06 14:49:04

使用tapply()或aggregate()可以找到组内的最大值

agent_product_sale=data.frame(agent=c('a','b','c','d','e','f','a','b','c','a','b'),
        +                               product=c('P1','P2','P3','P4','P1','p1','p2','p2','P2','P3','P3'),
        +                               sale_amount=c(1000,2000,3000,4000,1000,1000,2000,2000,2000,3000,3000))


tapply(agent_product_sale$sale_amount,agent_product_sale$agent, max)
               a    b    c    d    e    f 
            3000 3000 3000 4000 1000 1000 



aggregate(agent_product_sale$sale_amount,by=list(agent_product_sale$agent), max)
          Group.1    x
        1       a 3000
        2       b 3000
        3       c 3000
        4       d 4000
        5       e 1000
        6       f 1000

aggregate返回一个data.frame并键入一个数组，这取决于您的喜好，以继续处理结果。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39341942

复制

相似问题

问如何使用apply在每组SparkR中查找最大值
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用apply在每组SparkR中查找最大值EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用apply在每组SparkR中查找最大值
EN