首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >解析包含json数据的RDD

解析包含json数据的RDD
EN

Stack Overflow用户
提问于 2017-12-11 01:37:00
回答 1查看 3.1K关注 0票数 1

我有一个包含以下数据的json文件:

代码语言:javascript
复制
{"year":"2016","category":"physics","laureates":[{"id":"928","firstname":"David J.","surname":"Thouless","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"2"},{"id":"929","firstname":"F. Duncan M.","surname":"Haldane","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"4"},{"id":"930","firstname":"J. Michael","surname":"Kosterlitz","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"4"}]}
{"year":"2016","category":"chemistry","laureates":[{"id":"931","firstname":"Jean-Pierre","surname":"Sauvage","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"},{"id":"932","firstname":"Sir J. Fraser","surname":"Stoddart","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"},{"id":"933","firstname":"Bernard L.","surname":"Feringa","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"}]}

我需要返回一个RDD作为键值对,其中类别作为键,诺贝尔奖获得者的姓氏列表作为值。我怎么可能使用转换做到这一点呢?

对于给定的数据集,它应该是:

代码语言:javascript
复制
"physics"-"Thouless","haldane","Kosterlitz"
"chemistry"-"Sauvage","Stoddart","Feringa"
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-12-11 02:37:47

您是否仅限于使用RDDs?如果您可以使用DataFrames,那么加载它将非常简单,您将获得一个模式,分解嵌套的字段,分组,然后使用RDDs进行其余的操作。这里有一种你可以做到的方法

将JSON加载到DataFrame中,还可以确认您的模式

代码语言:javascript
复制
>>> nobelDF = spark.read.json('/user/cloudera/nobel.json')
>>> nobelDF.printSchema()
root
 |-- category: string (nullable = true)
 |-- laureates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstname: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- motivation: string (nullable = true)
 |    |    |-- share: string (nullable = true)
 |    |    |-- surname: string (nullable = true)
 |-- year: string (nullable = true)

现在,您可以分解嵌套数组,然后将其转换为可以分组的RDD

代码语言:javascript
复制
nobelRDD = nobelDF.select('category', explode('laureates.surname')).rdd

仅供参考,爆炸的DataFrame如下所示

代码语言:javascript
复制
+---------+----------+
| category|       col|
+---------+----------+
|  physics|  Thouless|
|  physics|   Haldane|
|  physics|Kosterlitz|
|chemistry|   Sauvage|
|chemistry|  Stoddart|
|chemistry|   Feringa|
+---------+----------+

现在按类别分组

代码语言:javascript
复制
from pyspark.sql.functions import collect_list
nobelRDD = nobelDF.select('category', explode('laureates.surname')).groupBy('category').agg(collect_list('col').alias('sn')).rdd
nobelRDD.collect()

现在,您获得了一个包含所需内容的RDD,尽管它仍然是一个Row对象(我添加了新行以显示整行)。

代码语言:javascript
复制
>>> for n in nobelRDD.collect():
...     print n
...
Row(category=u'chemistry', sn=[u'Sauvage', u'Stoddart', u'Feringa'])
Row(category=u'physics', sn=[u'Thouless', u'Haldane', u'Kosterlitz'])

但这将是一个获取元组的简单映射(我添加了新行来显示整行)

代码语言:javascript
复制
>>> nobelRDD.map(lambda x: (x[0],x[1])).collect()
[(u'chemistry', [u'Sauvage', u'Stoddart', u'Feringa']), 
 (u'physics', [u'Thouless', u'Haldane', u'Kosterlitz'])]
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47741565

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档