例如,我有一个这样的数据文件:
import random
import pandas as pd
l = []
for day in range(30):
for userid in range(random.randint(5,30)):
l.append([day,userid])
df = pd.DataFrame(l, columns=['Day','UserID'])我想知道,过去7天登录用户的唯一用户计数。我现在使用的代码如下:
result = {}
df.set_index('Day',inplace=True)
for d in df.index.unique():
result[d] = df[(df.index>=d-6) & (df.index<=d)].count()
pd.DataFrame(result).T但我觉得我的代码糟透了。你能告诉我一种更优雅的方法吗?比如pandas.rolling_sum或者其他什么?
发布于 2017-02-13 08:25:21
你是对的: rolling_sum是正确的。我对待它的方式,即使它不是一条条线:
# Generate random log data 'your' way
import random
import pandas as pd
l = []
for day in range(30):
for userid in range(random.randint(5,30)):
l.append([day,userid])
df = pd.DataFrame(l, columns=['Day','UserID'])
# Calculate desired statistics
df = df.groupby("Day").count() #calculate daily log count (this command will make a Day colum your index by default too)
df.columns = ["Daily count"] #rename column to make it more meaningful
df["Weekly count"]=pd.rolling_sum(df["Daily count"],window=7,min_periods=1) #calculate weekly count of previous column这会产生你想要的结果:
Daily count Weekly count
Day
0 16 16.0
1 13 29.0
2 19 48.0
3 8 56.0
4 22 78.0
5 21 99.0
6 18 117.0
7 7 108.0
8 12 107.0
9 7 95.0
10 17 104.0
11 21 103.0
12 22 104.0
13 20 106.0
14 19 118.0
...https://stackoverflow.com/questions/42197784
复制相似问题