文章/答案/技术大牛

发布

社区首页 >问答首页 >spark数据列的更改模式

问spark数据列的更改模式
EN

Stack Overflow用户

提问于 2020-06-07 17:24:50

回答 1查看 3.2K关注 0票数 0

我有一个有“学生”栏的火花放电数据。

一项数据输入如下：

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

我希望更改该列的架构，以便该条目看起来如下：

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

问题是，在dataframe中，Student字段也可以是空的。因此，我希望保留空值，但更改null值的架构。我在上面的过程中使用了一个udf，这是可行的。

        def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

但是udf是火花的黑匣子。是否有任何方法可以使用内置的火花函数或sql查询来进行相同的操作。

python

dataframe

apache-spark

pyspark

apache-spark-sql

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-06-07 20:37:45

它的工作原理与这个答案完全一样。只需在结构中添加另一个嵌套级别：

或者作为SQL表达式：

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

或者在Python代码中使用结构函数

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

这两个版本的结果是相同的：

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

这两种方法对空行也很有效。使用此输入数据

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))

processedDf.show(truncate=False)打印

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]]                 |
+---------------------+

编辑：如果应该将整行设置为null而不是结构的字段，则可以添加一个什么时候

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

这将导致相同的架构，但是对于空行有不同的输出：

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|null                 |
+---------------------+

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62249074

复制

相似问题

问spark数据列的更改模式
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spark数据列的更改模式EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spark数据列的更改模式
EN