我需要分析最后60天,直到最后一个日期,每个用户是活跃的。
我的数据包含每个用户('DataSourceId')活动的日期(‘DataSourceId’)--每个日期一行。我已经将数据按DataSourceId分组,所以我在列中有日期,并抓取了每个用户活动的‘max_date’的最后一天:
df['max_date'] = df.groupby('DataSourceId')['CalendarDate'].transform('max')虽然'CalendarDate‘和'max_date’实际上是datetime64[ns]格式(活动值为float64),但数据看起来还是这样的:
ID Jan1 Jan2 Jan3 Jan4 Jan5... max_date
1 8 15 10 Jan5
2 2 13 Jan3
3 6 11 Jan2现在,我希望对每一行从日历日期调整到“最后x天”。如下所示:
ID Last Last-1 Last-2 Last-3 ... Last-x
1 10 15 8
2 13 2
3 11 6我没有找到任何类似的转变的例子,我真的被困在这里。
编辑:在适应了jezrael的解决方案后,我注意到它在occassion上失败了。
我认为这个问题与jezrael解决方案中的代码有关:r = data_wide.bfill().isna().sum(axis=1).values
示例:此数据失败( r = [0 3]):
CalendarDate 2017-07-02 2017-07-03 2017-07-06 2017-07-07 2017-07-08 2017-07-09
DataSourceId
1000648 NaN 188.37 178.37 NaN 128.37 18.37
1004507 51.19 NaN 52.19 53.19 NaN NaN特别地,重新对齐的dataframe如下所示:
Last-0 Last-1 Last-2 Last-3 Last-4 Last-5
DataSourceId
1000648 18.37 128.37 NaN 178.37 188.37 NaN
1004507 52.19 NaN 51.19 NaN NaN 53.19如果我通过将ID 1000648更改为1100648 (使其成为第二行)来更改数据文件中的顺序,这就是结果(r = [0 2]):
Last-0 Last-1 Last-2 Last-3 Last-4 Last-5
DataSourceId
1004507 NaN NaN 53.19 52.19 NaN 51.19
1100648 NaN 178.37 188.37 NaN 18.37 128.37发布于 2018-09-13 11:33:49
如果性能很重要,请使用稍微更改的numpy solution
#select all columns without last
A = df.iloc[:, 1:-1].values
print (A)
[[nan 8. nan 15. 10.]
[ 2. nan 13. nan nan]
[ 6. 11. nan nan nan]]
#count NaNs values
r = df.bfill(axis=1).isna().sum(axis=1).values
#oldier pandas versions
#r = df.bfill(axis=1).isnull().sum(axis=1).values
#boost solution by https://stackoverflow.com/a/30428192
#r = A.shape[1] - (~np.isnan(A)).cumsum(axis=1).argmax(axis=1) - 1
print (r)
[0 2 3]
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = np.flip(column_indices - r[:,np.newaxis], axis=1)
print (column_indices)
[[ 4 3 2 1 0]
[ 2 1 0 -1 -2]
[ 1 0 -1 -2 -3]]
result = A[rows, column_indices]
#https://stackoverflow.com/a/51613442
#result = strided_indexing_roll(A,r)
print (result)
[[10. 15. nan 8. nan]
[13. nan 2. nan nan]
[11. 6. nan nan nan]]c = [f'Last-{x}' for x in np.arange(result.shape[1])]
df1 = pd.DataFrame(result, columns=c)
df1.insert(0, 'ID', df['ID'])
print (df1)
ID Last-0 Last-1 Last-2 Last-3 Last-4
0 1 10.0 15.0 NaN 8.0 NaN
1 2 13.0 NaN 2.0 NaN NaN
2 3 11.0 6.0 NaN NaN NaN编辑:
如果ID是索引,那么解决方案会有一点变化--而不是由.iloc[:, :-1]删除第一列,最后只使用DataFrame构造器:
A = df.iloc[:, :-1].values
print (A)
[[nan 8. nan 15. 10.]
[ 2. nan 13. nan nan]
[ 6. 11. nan nan nan]]
r = df.bfill(axis=1).isna().sum(axis=1).values
print (r)
[0 2 3]
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = np.flip(column_indices - r[:,np.newaxis], axis=1)
print (column_indices)
[[ 4 3 2 1 0]
[ 2 1 0 -1 -2]
[ 1 0 -1 -2 -3]]
result = A[rows, column_indices]
print (result)
[[10. 15. nan 8. nan]
[13. nan 2. nan nan]
[11. 6. nan nan nan]]c = [f'Last-{x}' for x in np.arange(result.shape[1])]
#use DataFrame constructor
df1 = pd.DataFrame(result, columns=c, index=df.index)
print (df1)
Last-0 Last-1 Last-2 Last-3 Last-4
ID
1 10.0 15.0 NaN 8.0 NaN
2 13.0 NaN 2.0 NaN NaN
3 11.0 6.0 NaN NaN NaN发布于 2018-09-13 10:48:36
请尝试下面的代码,并让我知道这是否有帮助。
df = df.iloc[:,list(range(len(df.columns)-1,0,-1))]
print(df)发布于 2018-09-13 11:15:09
您可以使用这段代码首先找到最后一个连续的空值,并且使用每个系列的count,它都会工作。
df1 = df[df.columns.difference(['ID'])]
df1 = df1.apply(lambda x:x.shift(x[::-1].isnull().cumprod().sum())[::-1],axis=1)
df1.columns = ['Last-'+str(i) for i in range(df1.columns.shape[0])]
df1['ID'] = df['ID']退出:
Last-0 Last-1 Last-2 Last-3 Last-4 ID
0 10.0 15.0 NaN 8.0 NaN 1
1 13.0 NaN 2.0 NaN NaN 2
2 11.0 6.0 NaN NaN NaN 3https://stackoverflow.com/questions/52311730
复制相似问题