我正在为我的学生准备学习材料。为了方便起见,我想从URL访问数据,而不是要求他们提前下载。在这个例子中,我试图访问快速绘制的鸟图!谷歌数据集。
下面是访问远程存储的带有结果注释的数据的示例:
import pandas as pd
import os
import json
from glob import glob
# Convert top row to one dict
top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().values())[0]
# Load file from computer
base_dir = os.path.join('input', 'quickdraw_simplified')
obj_files = glob(os.path.join(base_dir, '*.ndjson'))
print(obj_files[0])
# input\quickdraw_simplified\full_simplified_bird.ndjson
c_json = pd.read_json(obj_files[0], lines = True, chunksize = 1)
# <pandas.io.json._json.JsonReader at 0x158ae631f10>
f_row = next(c_json)
# word countrycode timestamp recognized key_id drawing
# 0 bird US 2017-03-09 00:28:55.637750+00:00 True 4926006882205696 [[[0, 11, 23, 50, 72, 96, 97, 132, 158, 224, 2...
f_dict = top_row_dict(f_row)
# {'word': 'bird',
# 'countrycode': 'US',
# 'timestamp': Timestamp('2017-03-09 00:28:55.637750+0000', tz='UTC'),
# 'recognized': True,
# 'key_id': 4926006882205696,
# 'drawing': [[[0, 11, 23, 50, 72, 96, 97, 132, 158, 224, 255],
# [22, 9, 2, 0, 26, 45, 71, 40, 27, 10, 9]]]}但是,当我尝试使用API链接进行同样的操作时,它失败了:
import pandas as pd
import json
top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().values())[0]
url = 'https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/simplified/bird.ndjson'
# Load dataset
c_json = pd.read_json(url, lines = True, chunksize = 1)
# <pandas.io.json._json.JsonReader at 0x24980a20a90>
f_row = next(c_json)
# __
f_dict = top_row_dict(f_row)
# IndexError: list index out of range发布于 2020-08-31 22:07:58
您试图使用的URL需要登录(因为它链接到云控制台)。
但是,数据集存储在一个可公开访问的Google云存储桶中。
这意味着您可以使用http://pypi.org/p/google-cloud-storage包直接从桶加载文件。
类似于:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('quickdraw_dataset')
blob = bucket.get_blob('full/simplified/bird.ndjson')
c_json = pd.read_json(blob, lines = True, chunksize = 1)
...https://stackoverflow.com/questions/63668685
复制相似问题