我有大约5,000个.gzip文件(每个文件大约1MB)。每个文件都包含jsonlines格式的数据。它看起来是这样的:
{"category_id":39,"app_id":12731}
{"category_id":45,"app_id":12713}
{"category_id":6014,"app_id":13567}我想解析这些文件,并将它们转换为pandas数据帧。有没有办法加快这个过程?这是我的代码,但有点慢(每个文件0.5s)
import pandas as pd
import jsonlines
import gzip
import os
import io
path = 'data/apps/'
files = os.listdir(path)
result = []
for n, file in enumerate(files):
print(n, file)
with open(f'{path}/{file}', 'rb') as f:
data = f.read()
unzipped_data = gzip.decompress(data)
decoded_data = io.BytesIO(unzipped_data)
reader = jsonlines.Reader(decoded_data)
for line in reader:
if line['category_id'] == 6014:
result.append(line)
df = pd.DataFrame(result)发布于 2020-03-23 22:38:54
这应该允许您读取每一行,而无需加载整个文件。
import pandas as pd
import json
import gzip
import os
path = 'data/apps/'
files = os.listdir(path)
result = []
for n, file in enumerate(files):
print(n, file)
with gzip.open(f'{path}/{file}') as f:
for line in f:
data = json.loads(line)
if data['category_id'] == 6014:
result.append(data)
df = pd.DataFrame(result)https://stackoverflow.com/questions/60815579
复制相似问题