首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Spark-sql查询

Spark-sql查询
EN

Stack Overflow用户
提问于 2020-05-29 02:22:30
回答 1查看 37关注 0票数 0

嗨,我得到了一个错误与以下一段代码。

代码语言:javascript
复制
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._


// Define case classe for input data
case class Article(articleId: Int, title: String, url: String, publisher: String,
                   category: String, storyId: String, hostname: String, timestamp: String)
// Read the input data
val articles = spark.read.
  schema(Encoders.product[Article].schema).
  option("delimiter", ",").
  csv("hdfs:///user/ashhall1616/bdc_data/t4/news-small.csv").
  as[Article]

articles.createOrReplaceTempView("articles")

val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1 
FROM articles
GROUP BY storyId
ORDER BY publisher1 ASC""")

错误:

代码语言:javascript
复制
val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1 
     | FROM articles
     | GROUP BY storyId
     | ORDER BY publisher1 ASC""")
org.apache.spark.sql.AnalysisException: expression 'articles.`publisher`' is neither present in the group by, nor is it an aggregate function. Add to group by or w
rap in first() (or first_value) if you don't care which value you get.;;
Sort [publisher1#36 ASC NULLS FIRST], true
+- Aggregate [storyId#13], [storyId#13 AS storyId1#35, publisher#11 AS publisher1#36]
   +- SubqueryAlias articles
      +- Relation[articleId#8,title#9,url#10,publisher#11,category#12,storyId#13,hostname#14,timestamp#15] csv

数据集如下所示:

articleId发布者类别storyId主机名

1|洛杉矶时报|B| ddUyU0VZz0BRneMioxUPQVP6sIxvM | www.latimes.com

目标是创建每个故事的列表,与为该故事至少撰写一篇文章的每个出版商配对。

ddUyU0VZz0BRneMioxUPQVP6sIxvM,Livemint

ddUyU0VZz0BRneMioxUPQVP6sIxvM,IFA杂志

ddUyU0VZz0BRneMioxUPQVP6sIxvM,货币新闻

纳斯达克ddUyU0VZz0BRneMioxUPQVP6sIxvM

dPhGU51DcrolUIMxbRm0InaHGA2XM,IFA杂志

ddUyU0VZz0BRneMioxUPQVP6sIxvM,《洛杉矶时报》

纳斯达克dPhGU51DcrolUIMxbRm0InaHGA2XM

有人能建议改进代码以获得所需的输出吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-05-29 03:00:55

解析器,编译器弄糊涂了。

您的GROUP BY没有AGGregate。在storyid、publisher上使用DISTINCT。

检查是否还需要在GROUP BY上使用storyId1。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62071847

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档