文章/答案/技术大牛

发布

问Spark-sql查询
EN

Stack Overflow用户

提问于 2020-05-29 02:22:30

回答 1查看 37关注 0票数 0

嗨，我得到了一个错误与以下一段代码。

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._


// Define case classe for input data
case class Article(articleId: Int, title: String, url: String, publisher: String,
                   category: String, storyId: String, hostname: String, timestamp: String)
// Read the input data
val articles = spark.read.
  schema(Encoders.product[Article].schema).
  option("delimiter", ",").
  csv("hdfs:///user/ashhall1616/bdc_data/t4/news-small.csv").
  as[Article]

articles.createOrReplaceTempView("articles")

val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1 
FROM articles
GROUP BY storyId
ORDER BY publisher1 ASC""")

错误：

val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1 
     | FROM articles
     | GROUP BY storyId
     | ORDER BY publisher1 ASC""")
org.apache.spark.sql.AnalysisException: expression 'articles.`publisher`' is neither present in the group by, nor is it an aggregate function. Add to group by or w
rap in first() (or first_value) if you don't care which value you get.;;
Sort [publisher1#36 ASC NULLS FIRST], true
+- Aggregate [storyId#13], [storyId#13 AS storyId1#35, publisher#11 AS publisher1#36]
   +- SubqueryAlias articles
      +- Relation[articleId#8,title#9,url#10,publisher#11,category#12,storyId#13,hostname#14,timestamp#15] csv

数据集如下所示：

articleId发布者类别storyId主机名

1|洛杉矶时报|B| ddUyU0VZz0BRneMioxUPQVP6sIxvM | www.latimes.com

目标是创建每个故事的列表，与为该故事至少撰写一篇文章的每个出版商配对。

ddUyU0VZz0BRneMioxUPQVP6sIxvM，Livemint

ddUyU0VZz0BRneMioxUPQVP6sIxvM，IFA杂志

ddUyU0VZz0BRneMioxUPQVP6sIxvM，货币新闻

纳斯达克ddUyU0VZz0BRneMioxUPQVP6sIxvM

dPhGU51DcrolUIMxbRm0InaHGA2XM，IFA杂志

ddUyU0VZz0BRneMioxUPQVP6sIxvM，《洛杉矶时报》

纳斯达克dPhGU51DcrolUIMxbRm0InaHGA2XM

有人能建议改进代码以获得所需的输出吗？

scala

dataframe

apache-spark-sql

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-05-29 03:00:55

解析器，编译器弄糊涂了。

您的GROUP BY没有AGGregate。在storyid、publisher上使用DISTINCT。

检查是否还需要在GROUP BY上使用storyId1。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62071847

复制

相似问题

问Spark-sql查询
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Spark-sql查询EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Spark-sql查询
EN