首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >按月对python/大熊猫的每日数据进行分组,然后进行正常化

按月对python/大熊猫的每日数据进行分组,然后进行正常化
EN

Stack Overflow用户
提问于 2013-09-07 19:50:29
回答 1查看 28.3K关注 0票数 15

下面的桌子是潘达斯DataFrame

代码语言:javascript
复制
    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron          364         2012-10-02 00:00:00
2   current         280         2012-10-02 00:00:00
3   molecular       259         2012-10-02 00:00:00
4   stem            201         2012-10-02 00:00:00

该表按日包含来自服务器日志的查询卷。我想做两件事:

  1. 我想按月对查询进行分组,总结整个月的查询量,例如,如果2012-10-02年的查询量为1000卷,而2012-10-03年的查询量为500卷,则在新表1500 (卷)中应该有一个条目,日期为2012-10-31 (月底代表月份的月底-转换后的表中的所有日期将是表示与之相关的整个月的月底)。
  2. 我想添加第5列,其中包含month-normalized q_visits。也就是说,一个术语的每月查询量除以该月份的所有术语的总查询量。

做这件事最好的方法是什么?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-09-07 20:16:13

如果我没听错的话:

对于(1)这样做:

从你给出的数据中抽取一些假的数据,随机的日期和访问的#:

代码语言:javascript
复制
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')

In [180]: visits = Series(poisson(1000, size=100), name='date')

In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')

In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})

In [183]: df.head()
Out[183]:
                 date   string  visits
0 2001-11-15 00:00:00  current     997
1 2001-11-15 00:00:00  current     974
2 2012-10-02 00:00:00     stem     982
3 2001-12-01 00:00:00     stem     984
4 2001-01-01 00:00:00  current     989

In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')

In [187]: resamp.head()
Out[187]:
                    visits
string  date
current 2001-01-31    2996
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31    3016

NaN之所以存在,是因为在这几个月中没有使用该查询字符串进行访问。

对于(2),按日期分组,然后除以之和:

代码语言:javascript
复制
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())

In [189]: g.head()
Out[189]:
                    visits
string  date
current 2001-01-31   0.177
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31   0.188

只是为了让你相信(2)是做你想做的事:

代码语言:javascript
复制
In [176]: h = g.sortlevel('date').head()

In [177]: h
Out[177]:
                      visits
string    date
current   2001-01-31   0.077
molecular 2001-01-31   0.228
neuron    2001-01-31   0.073
nucleus   2001-01-31   0.234
stem      2001-01-31   0.388

In [178]: h.sum()
Out[178]:
visits    1
dtype: float64

如果要将resamp转换为DataFrame并删除NaN,请执行以下操作:

代码语言:javascript
复制
In [196]: resamp.dropna()
Out[196]:
                      visits
string    date
current   2001-01-31    2996
          2001-05-31    3016
          2001-11-30    5959
          2001-12-31    3998
          2013-09-30    1077
molecular 2001-01-31    3984
          2001-05-31    1911
          2001-11-30    3054
          2001-12-31    1020
          2012-10-31     977
          2013-09-30    1947
neuron    2001-01-31    3961
          2001-05-31    2069
          2001-11-30    5010
          2001-12-31    2065
          2012-10-31    6973
          2013-09-30     994
nucleus   2001-01-31    3060
          2001-05-31    3035
          2001-11-30    2924
          2001-12-31    4144
          2012-10-31    2004
          2013-09-30    7881
stem      2001-01-31    2911
          2001-05-31    5994
          2001-11-30    6072
          2001-12-31    4916
          2012-10-31    1991
          2013-09-30    3977

In [197]: resamp.dropna().reset_index()
Out[197]:
       string                date  visits
0     current 2001-01-31 00:00:00    2996
1     current 2001-05-31 00:00:00    3016
2     current 2001-11-30 00:00:00    5959
3     current 2001-12-31 00:00:00    3998
4     current 2013-09-30 00:00:00    1077
5   molecular 2001-01-31 00:00:00    3984
6   molecular 2001-05-31 00:00:00    1911
7   molecular 2001-11-30 00:00:00    3054
8   molecular 2001-12-31 00:00:00    1020
9   molecular 2012-10-31 00:00:00     977
10  molecular 2013-09-30 00:00:00    1947
11     neuron 2001-01-31 00:00:00    3961
12     neuron 2001-05-31 00:00:00    2069
13     neuron 2001-11-30 00:00:00    5010
14     neuron 2001-12-31 00:00:00    2065
15     neuron 2012-10-31 00:00:00    6973
16     neuron 2013-09-30 00:00:00     994
17    nucleus 2001-01-31 00:00:00    3060
18    nucleus 2001-05-31 00:00:00    3035
19    nucleus 2001-11-30 00:00:00    2924
20    nucleus 2001-12-31 00:00:00    4144
21    nucleus 2012-10-31 00:00:00    2004
22    nucleus 2013-09-30 00:00:00    7881
23       stem 2001-01-31 00:00:00    2911
24       stem 2001-05-31 00:00:00    5994
25       stem 2001-11-30 00:00:00    6072
26       stem 2001-12-31 00:00:00    4916
27       stem 2012-10-31 00:00:00    1991
28       stem 2013-09-30 00:00:00    3977

当然,您也可以为g这样做:

代码语言:javascript
复制
In [198]: g.dropna()
Out[198]:
                      visits
string    date
current   2001-01-31   0.177
          2001-05-31   0.188
          2001-11-30   0.259
          2001-12-31   0.248
          2013-09-30   0.068
molecular 2001-01-31   0.236
          2001-05-31   0.119
          2001-11-30   0.133
          2001-12-31   0.063
          2012-10-31   0.082
          2013-09-30   0.123
neuron    2001-01-31   0.234
          2001-05-31   0.129
          2001-11-30   0.218
          2001-12-31   0.128
          2012-10-31   0.584
          2013-09-30   0.063
nucleus   2001-01-31   0.181
          2001-05-31   0.189
          2001-11-30   0.127
          2001-12-31   0.257
          2012-10-31   0.168
          2013-09-30   0.496
stem      2001-01-31   0.172
          2001-05-31   0.374
          2001-11-30   0.264
          2001-12-31   0.305
          2012-10-31   0.167
          2013-09-30   0.251

In [199]: g.dropna().reset_index()
Out[199]:
       string                date  visits
0     current 2001-01-31 00:00:00   0.177
1     current 2001-05-31 00:00:00   0.188
2     current 2001-11-30 00:00:00   0.259
3     current 2001-12-31 00:00:00   0.248
4     current 2013-09-30 00:00:00   0.068
5   molecular 2001-01-31 00:00:00   0.236
6   molecular 2001-05-31 00:00:00   0.119
7   molecular 2001-11-30 00:00:00   0.133
8   molecular 2001-12-31 00:00:00   0.063
9   molecular 2012-10-31 00:00:00   0.082
10  molecular 2013-09-30 00:00:00   0.123
11     neuron 2001-01-31 00:00:00   0.234
12     neuron 2001-05-31 00:00:00   0.129
13     neuron 2001-11-30 00:00:00   0.218
14     neuron 2001-12-31 00:00:00   0.128
15     neuron 2012-10-31 00:00:00   0.584
16     neuron 2013-09-30 00:00:00   0.063
17    nucleus 2001-01-31 00:00:00   0.181
18    nucleus 2001-05-31 00:00:00   0.189
19    nucleus 2001-11-30 00:00:00   0.127
20    nucleus 2001-12-31 00:00:00   0.257
21    nucleus 2012-10-31 00:00:00   0.168
22    nucleus 2013-09-30 00:00:00   0.496
23       stem 2001-01-31 00:00:00   0.172
24       stem 2001-05-31 00:00:00   0.374
25       stem 2001-11-30 00:00:00   0.264
26       stem 2001-12-31 00:00:00   0.305
27       stem 2012-10-31 00:00:00   0.167
28       stem 2013-09-30 00:00:00   0.251

最后,如果要将列按不同的顺序放置,请使用reindex

代码语言:javascript
复制
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
    visits     string                date
0    0.177    current 2001-01-31 00:00:00
1    0.188    current 2001-05-31 00:00:00
2    0.259    current 2001-11-30 00:00:00
3    0.248    current 2001-12-31 00:00:00
4    0.068    current 2013-09-30 00:00:00
5    0.236  molecular 2001-01-31 00:00:00
6    0.119  molecular 2001-05-31 00:00:00
7    0.133  molecular 2001-11-30 00:00:00
8    0.063  molecular 2001-12-31 00:00:00
9    0.082  molecular 2012-10-31 00:00:00
10   0.123  molecular 2013-09-30 00:00:00
11   0.234     neuron 2001-01-31 00:00:00
12   0.129     neuron 2001-05-31 00:00:00
13   0.218     neuron 2001-11-30 00:00:00
14   0.128     neuron 2001-12-31 00:00:00
15   0.584     neuron 2012-10-31 00:00:00
16   0.063     neuron 2013-09-30 00:00:00
17   0.181    nucleus 2001-01-31 00:00:00
18   0.189    nucleus 2001-05-31 00:00:00
19   0.127    nucleus 2001-11-30 00:00:00
20   0.257    nucleus 2001-12-31 00:00:00
21   0.168    nucleus 2012-10-31 00:00:00
22   0.496    nucleus 2013-09-30 00:00:00
23   0.172       stem 2001-01-31 00:00:00
24   0.374       stem 2001-05-31 00:00:00
25   0.264       stem 2001-11-30 00:00:00
26   0.305       stem 2001-12-31 00:00:00
27   0.167       stem 2012-10-31 00:00:00
28   0.251       stem 2013-09-30 00:00:00
票数 33
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/18677271

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档