我得到了一个带有累积计数数据的数据。生成的示例如下(请随意跳过):
import numpy as np
import pandas as pd
cols = ['Start', 'End', 'Count']
data = np.array([
'2020-1-1', '2020-1-2', 4,
'2020-1-1', '2020-1-3', 6,
'2020-1-1', '2020-1-4', 8,
'2020-2-1', '2020-2-2', 3,
'2020-2-1', '2020-2-3', 4,
'2020-2-1', '2020-2-4', 4])
data = data.reshape((6,3))
df = pd.DataFrame(columns=cols, data=data)
df['Start'] = pd.to_datetime(df.Start)
df['End'] = pd.to_datetime(df.End)这提供了以下数据:
Start End Count
2020-1-1 2020-1-2 4
2020-1-1 2020-1-3 6
2020-1-1 2020-1-4 8
2020-2-1 2020-2-2 3
2020-2-1 2020-2-3 4
2020-2-1 2020-2-4 4计数是累积的(从开始时开始积累),我希望撤消要获取的积累(注意日期的更改):
Start End Count
2020-1-1 2020-1-2 4
2020-1-2 2020-1-3 2
2020-1-3 2020-1-4 2
2020-2-1 2020-2-2 3
2020-2-2 2020-2-3 1
2020-2-3 2020-2-4 0我想对分组变量这样做。这样做是天真的:
lst = []
for start, data in df.groupby(['Start', 'grouping_variable']):
data = data.sort_values('End')
diff = data.Count.diff()
diff.iloc[0] = data.Count.iloc[0]
start_dates = [data.Start.iloc[0]] + list(data.end[:-1].values)
data = data.assign(Start=start_dates,
Count=diff)
lst.append(data)
df = pd.concat(lst)这感觉不是“对”,“琵琶”或“干净”在任何方面。有更好的办法吗?也许潘达斯有一个特定的方法来做这件事?
发布于 2020-07-09 13:06:07
IIUC,我们可以使用cumcount和布尔值来捕获每个唯一的开始日期组,然后使用shift对每个组应用np.where操作。
import numpy as np
#df['Count'] = df['Count'].astype(int)
s = df.groupby(['Start']).cumcount() == 0
df['Count'] = np.where(s,df['Count'],df['Count'] - df['Count'].shift())
df['Start'] = np.where(s, df['Start'], df['End'].shift(1))
print(df)
Start End Count
0 2020-01-01 2020-01-02 4.0
1 2020-01-02 2020-01-03 2.0
2 2020-01-03 2020-01-04 2.0
3 2020-02-01 2020-02-02 3.0
4 2020-02-02 2020-02-03 1.0
5 2020-02-03 2020-02-04 0.0https://stackoverflow.com/questions/62814768
复制相似问题