文章/答案/技术大牛

发布

社区首页 >问答首页 >从dict元素列表中删除重复项(从Twitter json对象创建)

问从dict元素列表中删除重复项(从Twitter json对象创建)
EN

Stack Overflow用户

提问于 2021-12-20 16:17:34

回答 5查看 116关注 0票数 -3

我已经下载了Twitter用户的对象，

这是一个对象的示例

{
    "id": 6253282,
    "id_str": "6253282",
    "name": "Twitter API",
    "screen_name": "TwitterAPI",
    "location": "San Francisco, CA",
    "profile_location": null,
    "description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    "url": "https:\/\/t.co\/8IkCzCDr19",
    "entities": {
        "url": {
            "urls": [{
                "url": "https:\/\/t.co\/8IkCzCDr19",
                "expanded_url": "https:\/\/developer.twitter.com",
                "display_url": "developer.twitter.com",
                "indices": [
                    0,
                    23
                ]
            }]
        },
        "description": {
            "urls": []
        }
    },
    "protected": false,
    "followers_count": 6133636,
    "friends_count": 12,
    "listed_count": 12936,
    "created_at": "Wed May 23 06:01:13 +0000 2007",
    "favourites_count": 31,
    "utc_offset": null,
    "time_zone": null,
    "geo_enabled": null,
    "verified": true,
    "statuses_count": 3656,
    "lang": null,
    "contributors_enabled": null,
    "is_translator": null,
    "is_translation_enabled": null,
    "profile_background_color": null,
    "profile_background_image_url": null,
    "profile_background_image_url_https": null,
    "profile_background_tile": null,
    "profile_image_url": null,
    "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    "profile_banner_url": null,
    "profile_link_color": null,
    "profile_sidebar_border_color": null,
    "profile_sidebar_fill_color": null,
    "profile_text_color": null,
    "profile_use_background_image": null,
    "has_extended_profile": null,
    "default_profile": false,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null,
    "translator_type": null
}

但不知何故，它有许多重复的，也许输入文件有重复的值。

这是下载的Twitter文件的模式。我把它命名为rawjson { user-object }{ user-object }{ user-object }

因此，我最终得到了一个16 GB的用户文件，该文件具有重复的值。我需要删除重复的用户。

这就是我迄今为止所做的

def twitterToListJsonMethodTwo(self, rawjson, twitterToListJson):
# Delete Old File
if (os.path.exists(twitterToListJson)):
    try:
        os.remove(twitterToListJson)
    except OSError:
        pass
counter = 1
objc = 1
with open(rawjson, encoding='utf8') as fin, open(twitterToListJson, 'w', encoding='utf8') as fout:
    for line in fin:
        if (line.find('}{') != -1 and len(line) == 3):
            objc = objc + 1
            fout.write(line.replace('}{', '},\n{'))
        else:
            fout.write(line)
        counter = counter + 1
        # print(counter)
    print("Process Complete: Twitter object to Total lines: ", counter)

    self.twitterToListJsonMethodOne(twitterToListJson)

输出示例文件如下所示。现在

[
    {user-object},
    {user-object},
    {user-object} 
]

虽然每个user-object都是dict，但我无法找到删除副本的方法，但我找到的所有教程/解决方案都是针对小对象和小列表的。我对python不是很好，但是我需要一些最优的解决方案，因为文件大小太大，内存可能是个问题。

每个用户对象如下所示，具有唯一的id和screen_name

python

json

twitter

回答 5

Stack Overflow用户

发布于 2021-12-20 16:38:43

要处理庞大的JSON数据集，特别是对象的长列表，最好使用来自https://github.com/daggaz/json-stream的JSON流逐个读取用户对象，如果以前没有遇到这个用户，则将它们添加到结果中。

示例：

import json_stream

unique_users = []
seen_users = set()
with open('input.json') as f:
    js = json_stream.load(f)
    for us in js:
        user = dict(us.items())
        if user['id'] not in seen_users:
            unique_users.append(user)
            seen_users.add(user['id'])

user = dict(us.items())的原因是，如果我们通过流在对象中查找id，我们就不能再回溯到获取整个对象了。因此，我们需要“呈现”出每个用户对象，然后检查id。

票数 1

Stack Overflow用户

发布于 2021-12-20 16:29:11

您可以修改合并排序，只需删除O(nlogn)中的重复项。

票数 0

Stack Overflow用户

发布于 2021-12-20 16:32:30

使用ijson就像使用here一样。

创建一个set，它将保存项id。

如果id在设置中--删除该项，则收集该项。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70424815

复制

相似问题

问从dict元素列表中删除重复项(从Twitter json对象创建)
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从dict元素列表中删除重复项(从Twitter json对象创建)EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从dict元素列表中删除重复项(从Twitter json对象创建)
EN