文章/答案/技术大牛

发布

社区首页 >问答首页 >Pandas dataframe索引的itertools.permutations使用了太多内存

问Pandas dataframe索引的itertools.permutations使用了太多内存
EN

Stack Overflow用户

提问于 2020-06-01 01:10:13

回答 1查看 320关注 0票数 3

我正在尝试基于另一个Dataframe的排列来创建新的Dataframe。这是原始的数据帧。价格就是指数。

df1
Price     Bid   Ask
1          .01   .05
2          .04   .08
3          .1    .15  
.           .      .
130        2.50  3.00

第二个Dataframe用于从df1获取索引并创建一个Dataframe (df2)，其中包含基于4个价格的df1索引的排列，如下面的示例输出所示。

df2
 #     price1   price2   price 3  price 4
 1       1        2         3       4
 2       1        2         3       5
 3       1        2         3       6
 ..       ..       ..        ..      ..

为了实现这一点，我一直在使用itertools.permutation，但我遇到了内存问题，无法执行大量的排列。这是我用来做排列的代码。

price_combos = list(x for x in itertools.permutations(df1.index, 4))
df2 = pd.DataFrame(price_combos , columns=('price1', 'price2', 'price3', 'price4'))

itertools

python

pandas

numpy

dataframe

回答 1

Stack Overflow用户

发布于 2020-06-01 04:35:59

dtypes可能导致内存分配膨胀。对于你的场景，我找到的最好的办法就是将数据帧索引设置到一个具有int16数据类型的df1.index数组中。
- int8的数值范围是-128到128。由于您的索引是从0到130，因此int8不会执行suffice.

- Creating a `price_combos` variable and then a dataframe, will use twice the amount of memory, so create `df2` without the intermediary step.
- If you create the dataframe without specifying the `dtype`, as you're doing, the `dtype` will be `int64`
- With the following implementation, there will be one object, `df2`, that will be 2,180,905,112 Bytes  
    - With the original implementation, there would be two `int64` objects of 8GB each, for a total of 16GB.

如果你正在使用Jupyter，它有可怕的内存，增加虚拟内存的数量/交换文件大小，会给你额外的缓冲所需的内存。虚拟内存为Windows，交换文件为Linux。这很容易做到，只要谷歌一下就行了。

import numpy as np
import pandas a pd
from itertools import permutations

# synthetic data set and create dataframe
np.random.seed(365)
data = {'Price': list(range(1, 131)),
        'Bid': [np.random.randint(1, 10)*0.1 for _ in range(130)]}

df1 = pd.DataFrame(data)
df1['Ask'] = df1.Bid + 0.15
df1.set_index('Price', inplace=True)

# convert the index to an int16 array
values = df1.index.to_numpy(dtype='int16')

# create df2
%%time
df2 = pd.DataFrame(np.array(list(permutations(values, 4))), columns=('price1', 'price2', 'price3', 'price4')) 
>>> Wall time: 2min 45s

print(df2.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272613120 entries, 0 to 272613119
Data columns (total 4 columns):
 #   Column  Dtype
---  ------  -----
 0   price1  int16
 1   price2  int16
 2   price3  int16
 3   price4  int16
dtypes: int16(4)
memory usage: 2.0 GB

df2.head()

   price1  price2  price3  price4
0       1       2       3       4
1       1       2       3       5
2       1       2       3       6
3       1       2       3       7
4       1       2       3       8

df2.tail()

           price1  price2  price3  price4
272613115     130     129     128     123
272613116     130     129     128     124
272613117     130     129     128     125
272613118     130     129     128     126
272613119     130     129     128     127

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62119757

复制

相似问题

问Pandas dataframe索引的itertools.permutations使用了太多内存
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas dataframe索引的itertools.permutations使用了太多内存EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Pandas dataframe索引的itertools.permutations使用了太多内存
EN