我有一个带有音乐声学特性的JSON文件(大约1GB)。我正试着把它读到我的熊猫笔记本上,用dataf = "/home/work/my.json" d = json.load(open(dataf, 'r')),它总是让我说错话。
额外数据:第2行第1栏(char 499)
我知道499字符是下一首曲目的开始,但是我已经在网上看过了,我不知道如何把它读进去。下面是数据的样本。
{"_id":{"$oid":"5b2cff21aecd2a723459cd65"},"id":1,"sp_id":"0XLOf9LhyazPX9Ld8jPiUq",“舞蹈能力”:0.7079999999999999627,“能量”:0.60999999999999998668,“关键”:“2”,“响度”:-4.5220000000000002416,“模式”:“1”,“语音”:0.057399999999999999999999999999634,“声学”:0.0204000000000000665,“仪器”:4.44999999999799457E-06,“活性”:0.0100000000000197,“价比”“节奏”:123.0379999999999967,"time_signature":"4",time_signature {"_id":{"$oid":"5b2cff21aecd2a723459cd66"},"id":2,“sp_id”:“7aF09WaavZAmaUeYxlD”,“舞蹈能力”:0.59299999999999997158,“能量”:0.86799999999999999378,“键”:“1”,“响度”:-3.5729999999999999538,“模式”:“0”,“语音”:0.29499999999999999999999446,“声学”:0.182999999999999996,“仪器性”:0.0,“活性”:0.36499999999999999112,“价”:0.49599999999999999645,“节拍”:104.98799999999999955,"time_signature":"4",time_signature{“_id”:{“$oid”:“5b2cff21aecd2a723459cd67},"id":3,"sp_id":"0tKcYR2II1VCQWT79i5NrW",”可跳舞性“:0.5999999999999999778,”能量“:0.8100000000000005329,“键”:“0”,“响度”:-4.748999999999999666,“模式”:“1”,“语音”:0.0478999999999998135,“声学”:0.0068300000000000001335,“仪器性”:0.20999999999999999223,“活性”:0.15499999999999999889,“价”:0.29799999999999998712,“节拍”:167.87999999999999545,"time_signature":"4",_id{“$oid”:{$oid:“5b2cff21aecd2a723459cd68”},"id":4,"sp_id":"6TWSVHx6z6E42JiwloGv1k",“舞蹈能力”:0.50300000000000000266,“能量”:0.91800000000000003819,“键”:“11”,“响度”:-5.0099999999999997868,“模式”:“1”,“语音”:0.04639999999999999996803,“声学”:0.016199999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999996803,“声学”:0.0161999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999"track_uri":"spotify:track:6TWSVHx6z6E42JiwloGv1k"} {"_id":{"$oid":"5b2cff21aecd2a723459cd69"},"id":5,"sp_id":"5QqyRUZeBE04yJxsD1OC0I",“舞蹈能力”:0.76000000000000000888,“能量”:0.56100000000000005418,“键”:“1”,“响度”:-8.6969999999999991758,“模式”:“1”,“语音”:0.13400000000000000799,“声学”:0.018499999999999999084,“仪器”:1.940000000000000000604e-05,“活性”:0.19900000000000001021,“价”:0.12099999999999999645,“节奏”:134.98300000000000409,"time_signature":"4",“time_signature”
发布于 2018-10-27 00:45:50
您的JSON不会解析,因为它是无效的JSON。解析器所抱怨的字符就在第一个换行符之后。显然,有一些对象逐行转储到文件中,这些对象加在一起并不包含有效的对象。请参见:
>>> json.loads(s[:499])
{'_id': {'$oid': '5b2cff21aecd2a723459cd65'},
'id': 1,
'sp_id': '0XLOf9LhyazPX9Ld8jPiUq',
'danceability': 0.708,
'energy': 0.61,
'key': '2',
'loudness': -4.522,
'mode': '1',
'speechiness': 0.0574,
'acousticness': 0.0204,
'instrumentalness': 4.45e-06,
'liveness': 0.0641,
'valence': 0.305,
'tempo': 123.038,
'time_signature': '4',
'track_uri': 'spotify:track:0XLOf9LhyazPX9Ld8jPiUq'}
>>> json.loads(s[499:973])
{'_id': {'$oid': '5b2cff21aecd2a723459cd66'},
'id': 2,
'sp_id': '7aF09WaavZAmAWuUeYxlYD',
'danceability': 0.593,
'energy': 0.868,
'key': '1',
'loudness': -3.573,
'mode': '0',
'speechiness': 0.295,
'acousticness': 0.183,
'instrumentalness': 0.0,
'liveness': 0.365,
'valence': 0.496,
'tempo': 104.988,
'time_signature': '4',
'track_uri': 'spotify:track:7aF09WaavZAmAWuUeYxlYD'}(s是加载到字符串中的示例输入。)这些对象一个接一个地打印到文件中。要么更改语法,使其成为对象列表(添加方括号和逗号),要么逐行解析文件,在输入的每一行上调用json.loads。
现在,不要引用我在这篇文章中的话,但是侵入您的输入从而使它成为有效的JSON是非常容易的:
>>> len(json.loads('[' + s.replace('\n', ',') + ']'))
5如果文件是巨大的,您可能不想在一次会议中执行上述攻击和随后的解析,因为这会带来巨大的内存开销。在这种情况下,我建议逐个对象解析文件对象。假设您的文件在每一行中包含一个对象,则只需要
dat = [json.loads(line) for line in open(infile)]其中infile是连接-JSON文件的路径。一个巨大的文件需要花费很长时间,结果将占用大量内存,但我预计用于解析的额外开销会以这种方式减少。
发布于 2018-10-27 03:33:02
看起来你在读取MongoDB数据库中的记录。出来的是一个逐行存储的JSON对象数组,这意味着它本身不是一个有效的JSON对象,正如@Andras所指出的那样
从MongoDB读取数据似乎要有效得多。
您可以这样使用PyMongo:
import pandas as pd
from pymongo import MongoClient
mdbClient = MongoClient('mongodb://localhost:27017/')
db = mdbClient['db']
collection = db['col']
results = collection.find({})
df = pd.DataFrame.from_records(results)https://stackoverflow.com/questions/53017795
复制相似问题