首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Pyspark:遍历多行嵌套json以构建数据框架

Pyspark:遍历多行嵌套json以构建数据框架
EN

Stack Overflow用户
提问于 2020-12-06 08:07:39
回答 2查看 341关注 0票数 1

伙计们,我需要一些帮助才能在pyspark中迭代下面的json…和a构建数据帧:

代码语言:javascript
复制
{
    "success": true,
    "result": {
        "0x00e01a648ff41346cdeb873182383333d2184dd1": {
            "id": 130,
            "name": "xn--mytherwallet-fvb.com",
            "url": "http://xn--mytherwallet-fvb.com",
            "coin": "ETH",
            "category": "Phishing",
            "subcategory": "MyEtherWallet",
            "description": "Homoglyph",
            "addresses": [
                "0x00e01a648ff41346cdeb873182383333d2184dd1",
                "0x11e01a648ff41346cdeb873182383333d2184dd1"
            ],
            "reporter": "MyCrypto",
            "status": "Offline"
        },
        "0x858457daa7e087ad74cdeeceab8419079bc2ca03": {
            "id": 1200,
            "name": "myetherwallet.in",
            "url": "http://myetherwallet.in",
            "coin": "ETH",
            "category": "Phishing",
            "subcategory": "MyEtherWallet",
            "addresses": ["0x858457daa7e087ad74cdeeceab8419079bc2ca03"],
            "reporter": "MyCrypto",
            "ip": "159.8.210.35",
            "nameservers": [
                "ns2.eftydns.com",
                "ns1.eftydns.com"
            ],
            "status": "Active"
        }
    }
}

我需要构建一个表示地址列表的dataframe。

EN

回答 2

Stack Overflow用户

发布于 2020-12-06 15:20:51

我把你的JSON格式化为SPARK-Readable格式。

代码语言:javascript
复制
{"success": true, "result": {"0x00e01a648ff41346cdeb873182383333d2184dd1": {"id": 130, "name": "xn--mytherwallet-fvb.com", "url": "http://xn--mytherwallet-fvb.com", "coin": "ETH", "category": "Phishing", "subcategory": "MyEtherWallet", "description": "Homoglyph", "addresses": ["0x00e01a648ff41346cdeb873182383333d2184dd1", "0x11e01a648ff41346cdeb873182383333d2184dd1"], "reporter": "MyCrypto", "status": "Offline"}, "0x858457daa7e087ad74cdeeceab8419079bc2ca03": {"id": 1200, "name": "myetherwallet.in", "url": "http://myetherwallet.in", "coin": "ETH", "category": "Phishing", "subcategory": "MyEtherWallet", "addresses": ["0x858457daa7e087ad74cdeeceab8419079bc2ca03"], "reporter": "MyCrypto", "ip": "159.8.210.35", "nameservers": ["ns2.eftydns.com", "ns1.eftydns.com"], "status": "Active"}}}

读取JSON

代码语言:javascript
复制
val df = spark.read.json("/my_data.json")

df.printSchema()
df.show(false)

输出

代码语言:javascript
复制
root
 |-- result: struct (nullable = true)
 |    |-- 0x00e01a648ff41346cdeb873182383333d2184dd1: struct (nullable = true)
 |    |    |-- addresses: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- coin: string (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- reporter: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subcategory: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |    |-- 0x858457daa7e087ad74cdeeceab8419079bc2ca03: struct (nullable = true)
 |    |    |-- addresses: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- coin: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- ip: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- nameservers: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- reporter: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- subcategory: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- success: boolean (nullable = true)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                     |success|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|[[WrappedArray(0x00e01a648ff41346cdeb873182383333d2184dd1, 0x11e01a648ff41346cdeb873182383333d2184dd1),Phishing,ETH,Homoglyph,130,xn--mytherwallet-fvb.com,MyCrypto,Offline,MyEtherWallet,http://xn--mytherwallet-fvb.com],[WrappedArray(0x858457daa7e087ad74cdeeceab8419079bc2ca03),Phishing,ETH,1200,159.8.210.35,myetherwallet.in,WrappedArray(ns2.eftydns.com, ns1.eftydns.com),MyCrypto,Active,MyEtherWallet,http://myetherwallet.in]]|true   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
票数 1
EN

Stack Overflow用户

发布于 2020-12-06 15:29:04

在Pyspark中,你可以做下面的事情。不需要重新格式化您的json -它是完全格式化的,您只需要将.option('multiline', True)传递给json阅读器。

代码语言:javascript
复制
df = spark.read.option('multiline', True).json('test.json')

获取地址的步骤:

代码语言:javascript
复制
import pyspark.sql.functions as F

df2 = df.select('result.*')
df3 = df2.select(
    F.explode(
        F.concat(
            *[F.col(f'{col}.addresses') for col in df2.columns]
        )
    ).alias('addresses')
)

df3.show(truncate=False)
+------------------------------------------+
|addresses                                 |
+------------------------------------------+
|0x00e01a648ff41346cdeb873182383333d2184dd1|
|0x11e01a648ff41346cdeb873182383333d2184dd1|
|0x858457daa7e087ad74cdeeceab8419079bc2ca03|
+------------------------------------------+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65163318

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档