我有一个数据帧,其结构如下:
Coumn1 Coumn2
0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,...
1 (0.00015607803652528673, 0.0001314736582571640... (0.0022136708721518517, 0.0014974646037444472,...
2 (0.011317798867821693, 0.011339936405420303, 0... (0.004868391435593367, 0.004406007472425699, 0...
3 (3.94578673876822e-05, 3.075833956245333e-05, ... (0.0075020878575742245, 0.0096737677231431, 0....
4 (0.0004926157998852432, 0.0003811710048466921,... (0.010351942852139473, 0.008231297135353088, 0...
.. ... ...
130 (0.011190211400389671, 0.011337820440530777, 0... (0.010182800702750683, 0.011351295746862888, 0...
131 (0.006286659277975559, 0.007315031252801418, 0... (0.02104150503873825, 0.02531484328210354, 0.0...
132 (0.0022791570518165827, 0.0025983047671616077,... (0.008847278542816639, 0.009222050197422504, 0...
133 (0.0007059817435219884, 0.0009831463685259223,... (0.0028264704160392284, 0.0029402063228189945,...
134 (0.0018992726691067219, 0.002058899961411953, ... (0.0019639385864138603, 0.002009353833273053, ...
[135 rows x 2 columns]其中每个单元格包含一些浮点值的列表/元组:
type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>(每个单元格条目在元组中包含相同数量的条目)
当我现在尝试将dataframe保存为parquet时,我得到了一个错误( save Parquet):
Can't infer object conversion type: 0 (0.00030271668219938874, 0.0002655923890415579...
1 (0.00015607803652528673, 0.0001314736582571640...
...
Name: Column1, dtype: object全栈跟踪:https://pastebin.com/8Myu8hNV
我也尝试了另一个引擎pyarrow:
pyarrow.lib.ArrowInvalid: ('Could not convert (0.00030271668219938874, ..., 0.0002464042045176029)
with type tuple: did not recognize Python value type when inferring an Arrow data type',
'Conversion failed for column UO-Pumpe with type object')所以我找到了这个线程https://github.com/dask/fastparquet/issues/458。这似乎是快速拼接中的一个bug --但在pyarrow中应该可以工作,这对我来说是失败的。
然后我尝试了一些我找到的东西,比如infer_objects()和astype(float) ...到目前为止,一切都不起作用。
谁有办法把我的数据帧保存到拼图上?
发布于 2021-03-26 02:43:08
数据帧的单元格包含浮点数的元组。这是一种不寻常的数据类型。
所以你需要给arrow一点帮助来弄清楚你的数据类型。为此,您需要显式地提供表的模式。
df = pd.DataFrame(
{
"column1": [(1.0, 2.0), (3.0, 4.0, 5.0)]
}
)
schema = pa.schema([pa.field('column1', pa.list_(pa.float64()))])
df.to_parquet('/tmp/hello.pq', schema=schema)请注意,如果您使用的是浮点数列表(而不是元组),那么它将会起作用:
df = pd.DataFrame(
{
"column1": [[1.0, 2.0], [3.0, 4.0, 5.0]]
}
)
df.to_parquet('/tmp/hello.pq')https://stackoverflow.com/questions/66801151
复制相似问题