嗨,我得到了一个错误与以下一段代码。
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._
// Define case classe for input data
case class Article(articleId: Int, title: String, url: String, publisher: String,
category: String, storyId: String, hostname: String, timestamp: String)
// Read the input data
val articles = spark.read.
schema(Encoders.product[Article].schema).
option("delimiter", ",").
csv("hdfs:///user/ashhall1616/bdc_data/t4/news-small.csv").
as[Article]
articles.createOrReplaceTempView("articles")
val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1
FROM articles
GROUP BY storyId
ORDER BY publisher1 ASC""")错误:
val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1
| FROM articles
| GROUP BY storyId
| ORDER BY publisher1 ASC""")
org.apache.spark.sql.AnalysisException: expression 'articles.`publisher`' is neither present in the group by, nor is it an aggregate function. Add to group by or w
rap in first() (or first_value) if you don't care which value you get.;;
Sort [publisher1#36 ASC NULLS FIRST], true
+- Aggregate [storyId#13], [storyId#13 AS storyId1#35, publisher#11 AS publisher1#36]
+- SubqueryAlias articles
+- Relation[articleId#8,title#9,url#10,publisher#11,category#12,storyId#13,hostname#14,timestamp#15] csv数据集如下所示:
articleId发布者类别storyId主机名
1|洛杉矶时报|B| ddUyU0VZz0BRneMioxUPQVP6sIxvM | www.latimes.com
目标是创建每个故事的列表,与为该故事至少撰写一篇文章的每个出版商配对。
ddUyU0VZz0BRneMioxUPQVP6sIxvM,Livemint
ddUyU0VZz0BRneMioxUPQVP6sIxvM,IFA杂志
ddUyU0VZz0BRneMioxUPQVP6sIxvM,货币新闻
纳斯达克ddUyU0VZz0BRneMioxUPQVP6sIxvM
dPhGU51DcrolUIMxbRm0InaHGA2XM,IFA杂志
ddUyU0VZz0BRneMioxUPQVP6sIxvM,《洛杉矶时报》
纳斯达克dPhGU51DcrolUIMxbRm0InaHGA2XM
有人能建议改进代码以获得所需的输出吗?
发布于 2020-05-29 03:00:55
解析器,编译器弄糊涂了。
您的GROUP BY没有AGGregate。在storyid、publisher上使用DISTINCT。
检查是否还需要在GROUP BY上使用storyId1。
https://stackoverflow.com/questions/62071847
复制相似问题