首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从字典中转换数据类型时,Pandas中的read_csv

从字典中转换数据类型时,Pandas中的read_csv
EN

Stack Overflow用户
提问于 2019-02-08 01:48:28
回答 2查看 577关注 0票数 0

我正在将大型CSV读取到数据帧中。Pandas正在将数据类型设置为float64,这使用了大量的内存(对于300MB的CSV,为2.4 GB )我想在导入期间将一些数据类型设置为Int。我在数据集中有很多NaN,所以我尝试利用Pandas0.24中的UInt数据类型。

最初,我读取CSV以提取包含源数据数据类型的头和第3行。它的格式为U1、U2和F4,因此我使用df.replace将它们分别更改为pandas数据类型、UInt8、UInt16和float64。然后,我从这个数据帧创建一个字典。当我读取CSV时,我尝试使用这个字典来定义数据类型,但是我总是得到:异常已经发生: TypeError数据类型不被理解

代码语言:javascript
复制
import pandas as pd 
import numpy as np 

filename = r'input/TST_short.csv'
dtype_df = pd.read_csv(filename, nrows=1, skiprows=[1,2])

dtype_df.replace('U1', 'UInt8', inplace=True)
dtype_df.replace('U2', 'UInt16',inplace=True)
dtype_df.replace('F4', 'float64',inplace=True)

dtype_dict = dtype_df.to_dict('records')

df = pd.read_csv(filename, skiprows=range(1,5), nrows=500, dtype=dtype_dict)
print(df)

如果去掉dtype=参数,df如下所示:

代码语言:javascript
复制
           TIME     ST1F  ST2F  ST3F  ST4F    P12V
0    43476.34869  32768.0   1.0   0.0   0.0  12.161
1    43476.34870      NaN   NaN   NaN   NaN     NaN
2    43476.34872      NaN   NaN   NaN   NaN     NaN
3    43476.34873      NaN   NaN   NaN   NaN     NaN
4    43476.34874      NaN   NaN   NaN   NaN     NaN
5    43476.34875      NaN   NaN   NaN   NaN     NaN
6    43476.34876      NaN   NaN   NaN   NaN     NaN
7    43476.34877      NaN   NaN   NaN   NaN     NaN
8    43476.34878      NaN   NaN   NaN   NaN     NaN
9    43476.34880      NaN   NaN   NaN   NaN     NaN
10   43476.34881  32768.0   1.0   0.0   0.0     NaN
11   43476.34882      NaN   NaN   NaN   NaN  12.161
12   43476.34883      NaN   NaN   NaN   NaN     NaN
13   43476.34884      NaN   NaN   NaN   NaN     NaN 

所有内容都转换为float64

字典(dtype_dict)如下所示:

代码语言:javascript
复制
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}

如果我没有提供足够的信息,那就放过我吧,这是我的第一篇文章,我还是个新手:)

更新:谢谢你的帮助,看起来我的字典格式不是很好。我已经尝试了所有的字典类型,我认为记录是最接近的,但是当我比较的时候,我可以看到我用方括号括住了来自df的字典。

代码语言:javascript
复制
d = dtype_df.to_dict('records')
d1 = {
    'TIME': 'float64',
    'ST1F': 'UInt16',
    'ST2F': 'UInt8',
    'ST3F': 'UInt8',
    'ST4F': 'UInt8',
    'P12V': 'float64'}

输出:

代码语言:javascript
复制
[{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}]
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}

我生成字典的数据帧是这样的:

代码语言:javascript
复制
      TIME    ST1F   ST2F   ST3F   ST4F     P12V
0  float64  UInt16  UInt8  UInt8  UInt8  float64
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-02-08 01:58:20

您的代码在这里完全按照预期工作。你确定你使用的是0.24.0还是0.24.1?

代码语言:javascript
复制
In [27]: pd.read_csv('test.csv').head()  # Spits out floats
Out[27]:
          TIME     ST1F  ST2F  ST3F  ST4F    P12V
0  43476.34869  32768.0   1.0   0.0   0.0  12.161
1  43476.34870      NaN   NaN   NaN   NaN     NaN
2  43476.34872      NaN   NaN   NaN   NaN     NaN
3  43476.34873      NaN   NaN   NaN   NaN     NaN
4  43476.34874      NaN   NaN   NaN   NaN     NaN

In [28]: pd.read_csv('test.csv', dtype=d).head()  # Has the proper types
Out[28]:
          TIME   ST1F  ST2F  ST3F  ST4F    P12V
0  43476.34869  32768     1     0     0  12.161
1  43476.34870    NaN   NaN   NaN   NaN     NaN
2  43476.34872    NaN   NaN   NaN   NaN     NaN
3  43476.34873    NaN   NaN   NaN   NaN     NaN
4  43476.34874    NaN   NaN   NaN   NaN     NaN

In [29]: d  # Let's check that this dict matches yours on the nose
Out[29]:
{'TIME': 'float64',
 'ST1F': 'UInt16',
 'ST2F': 'UInt8',
 'ST3F': 'UInt8',
 'ST4F': 'UInt8',
 'P12V': 'float64'}

In [30]: pd.read_csv('test.csv', dtype=d).info()  # And the types look good as well
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 6 columns):
TIME    14 non-null float64
ST1F    2 non-null UInt16
ST2F    2 non-null UInt8
ST3F    2 non-null UInt8
ST4F    2 non-null UInt8
P12V    2 non-null float64
dtypes: UInt16(1), UInt8(3), float64(2)
memory usage: 430.0 bytes

In [32]: pd.__version__
Out[32]: '0.24.0'
票数 2
EN

Stack Overflow用户

发布于 2019-02-08 04:19:32

问题是,我将我的字典作为包含字典的列表输出,可能有一种更优雅的方法来解决这个问题,但我通过传递包含字典的列表的索引来解决这个问题。

代码语言:javascript
复制
df = pd.read_csv(filename, skiprows=range(1,5), dtype=d[0])

编辑:解决了这个问题,然后看到了早些时候发布的带有答案的评论:)

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54579332

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档