文章/答案/技术大牛

发布

问python pandas分组优化
EN

Stack Overflow用户

提问于 2014-10-16 23:52:13

回答 1查看 513关注 0票数 0

我有一个包含许多行和列的大型数据帧，我需要按其中一列“group”进行分组。下面是一个小示例

  group      rank             word
0     a  0.739631           entity
1     a  0.882556  physical_entity
2     b  0.588045      abstraction
3     b  0.640933            thing
4     c  0.726738           object
5     c  0.669280            whole
6     d  0.006574         congener
7     d  0.308684     living_thing
8     d  0.638631         organism
9     d  0.464244          benthos

基本上，我将应用一系列函数来创建新列并在group by之后转换现有列，例如：

我想要实现的功能之一是top_word，它为每个组选择排名最高的单词。因此它的输出将是一个unicode列：

group    top_word
a    physical_entity [0.88]
b    thing [0.64]
c    object [0.73]
d    organism [0.63]

目前，我正在使用这个可怕的方法：

def top_word(tab):
    first = tab.iloc[0]
    res = '{} [{:.2f}]'.format(first['word'], first['rank'])
    return [res]

def aggr(x, fns):
    d = {key: fn(x) for key, fn in fns.iteritems()}
    return pd.DataFrame(d)

fs = {'top_word': top_word}
T = T.sort('rank', ascending=False) #sort by rank then I only have to pick the first result in the aggfunc!
T = T.groupby('group', sort=False).apply(lambda x: aggr(x, fs))
T.index = T.index.droplevel(level=1)

这给出了(例如，由于随机数生成器而产生的不同结果)：

time taken: 0.0042  +- 0.0003 seconds
                 top_word
group                    
a           entity [0.07]
b      abstraction [0.84]
c           object [0.92]
d         congener [0.06]

我设计了这个方法，这样我就可以在任何时候将任何我想要的函数应用到表中。它需要保持这种灵活性，但它看起来太可怕了！有没有更有效的方法来做这样的事情？在组上迭代+追加是更好的吗？

谢谢

group-by

python

pandas

回答 1

Stack Overflow用户

发布于 2014-10-17 01:38:58

我认为这个想法是先groupby，然后sort每个group，并使用.agg()保存第一个观察值

In [192]:

print df
  group      rank             word
0     a  0.739631           entity
1     a  0.882556  physical_entity
2     b  0.588045      abstraction
3     b  0.640933            thing
4     c  0.726738           object
5     c  0.669280            whole
6     d  0.006574         congener
7     d  0.308684     living_thing
8     d  0.638631         organism
9     d  0.464244          benthos
In [193]:

print df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0])
           rank             word
group                           
a      0.882556  physical_entity
b      0.640933            thing
c      0.726738            whole
d      0.638631         organism
In [194]:

df_res = df.groupby('group').agg(lambda x: sorted(x, reverse=True)[0])
df_res.word+df_res['rank'].apply(lambda x: ' [%.2f]'%x)
Out[194]:
group
a        physical_entity [0.88]
b                  thing [0.64]
c                  whole [0.73]
d               organism [0.64]
dtype: object

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/26408751

复制

相似问题

问python pandas分组优化
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python pandas分组优化EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python pandas分组优化
EN