我已经下载了Twitter用户的对象,
这是一个对象的示例
{
"id": 6253282,
"id_str": "6253282",
"name": "Twitter API",
"screen_name": "TwitterAPI",
"location": "San Francisco, CA",
"profile_location": null,
"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
"url": "https:\/\/t.co\/8IkCzCDr19",
"entities": {
"url": {
"urls": [{
"url": "https:\/\/t.co\/8IkCzCDr19",
"expanded_url": "https:\/\/developer.twitter.com",
"display_url": "developer.twitter.com",
"indices": [
0,
23
]
}]
},
"description": {
"urls": []
}
},
"protected": false,
"followers_count": 6133636,
"friends_count": 12,
"listed_count": 12936,
"created_at": "Wed May 23 06:01:13 +0000 2007",
"favourites_count": 31,
"utc_offset": null,
"time_zone": null,
"geo_enabled": null,
"verified": true,
"statuses_count": 3656,
"lang": null,
"contributors_enabled": null,
"is_translator": null,
"is_translation_enabled": null,
"profile_background_color": null,
"profile_background_image_url": null,
"profile_background_image_url_https": null,
"profile_background_tile": null,
"profile_image_url": null,
"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
"profile_banner_url": null,
"profile_link_color": null,
"profile_sidebar_border_color": null,
"profile_sidebar_fill_color": null,
"profile_text_color": null,
"profile_use_background_image": null,
"has_extended_profile": null,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null,
"translator_type": null
}但不知何故,它有许多重复的,也许输入文件有重复的值。
这是下载的Twitter文件的模式。我把它命名为rawjson { user-object }{ user-object }{ user-object }
因此,我最终得到了一个16 GB的用户文件,该文件具有重复的值。我需要删除重复的用户。
这就是我迄今为止所做的
def twitterToListJsonMethodTwo(self, rawjson, twitterToListJson):
# Delete Old File
if (os.path.exists(twitterToListJson)):
try:
os.remove(twitterToListJson)
except OSError:
pass
counter = 1
objc = 1
with open(rawjson, encoding='utf8') as fin, open(twitterToListJson, 'w', encoding='utf8') as fout:
for line in fin:
if (line.find('}{') != -1 and len(line) == 3):
objc = objc + 1
fout.write(line.replace('}{', '},\n{'))
else:
fout.write(line)
counter = counter + 1
# print(counter)
print("Process Complete: Twitter object to Total lines: ", counter)
self.twitterToListJsonMethodOne(twitterToListJson)输出示例文件如下所示。现在
[
{user-object},
{user-object},
{user-object}
]虽然每个user-object都是dict,但我无法找到删除副本的方法,但我找到的所有教程/解决方案都是针对小对象和小列表的。我对python不是很好,但是我需要一些最优的解决方案,因为文件大小太大,内存可能是个问题。
每个用户对象如下所示,具有唯一的id和screen_name
发布于 2021-12-20 16:38:43
要处理庞大的JSON数据集,特别是对象的长列表,最好使用来自https://github.com/daggaz/json-stream的JSON流逐个读取用户对象,如果以前没有遇到这个用户,则将它们添加到结果中。
示例:
import json_stream
unique_users = []
seen_users = set()
with open('input.json') as f:
js = json_stream.load(f)
for us in js:
user = dict(us.items())
if user['id'] not in seen_users:
unique_users.append(user)
seen_users.add(user['id'])user = dict(us.items())的原因是,如果我们通过流在对象中查找id,我们就不能再回溯到获取整个对象了。因此,我们需要“呈现”出每个用户对象,然后检查id。
发布于 2021-12-20 16:29:11
您可以修改合并排序,只需删除O(nlogn)中的重复项。
发布于 2021-12-20 16:32:30
https://stackoverflow.com/questions/70424815
复制相似问题