我有一个表格,如下所示。第一列是年份,第二列是路面处理类型,第三列是路面评分。我需要创建名为'year diff‘的第三列,方法是从当前分数的年份减去最后一次治疗的年份。例如,2014年需要减去2013年,因为处理9是在2013年完成的,而结果1需要记录在相应单元格的col diff中。2022年需要减去2020年,因为处理10是在2020年完成的。

非常感谢大家的帮助。
由衷地
威尔逊
发布于 2019-01-30 14:11:05
IIUC,你可以使用:
df['identifier']=(df['year'].diff().eq(1)&df['treatment'].notnull()).cumsum()
df['year diff ']=df.groupby('identifier')['identifier'].apply\
(lambda x: pd.Series(np.where(x!=0,pd.Series(pd.factorize(x)[0]+1).cumsum().shift(),np.nan))).values
print(df)或者,如果你需要根据治疗中的值来考虑分数的差异:
df['identifier']=(df['year'].diff().eq(1) &df['treatment'].notnull()).cumsum()
df['year diff']=df.groupby('identifier')['score']\
.apply(lambda x : pd.Series(np.where(x!=0,x.diff().expanding().sum(),np.nan))).reset_index(drop=True)
df.loc[df['identifier']==0,'year diff']=np.nan
print(df)
year treatment score identifier year diff
0 2010 NaN 1 0 NaN
1 2011 NaN 2 0 NaN
2 2012 NaN 3 0 NaN
3 2013 9.0 4 1 NaN
4 2014 NaN 5 1 1.0
5 2015 NaN 6 1 2.0
6 2016 NaN 7 1 3.0
7 2017 NaN 8 1 4.0
8 2018 NaN 9 1 5.0
9 2019 NaN 10 1 6.0
10 2020 10.0 11 2 NaN
11 2021 NaN 12 2 1.0
12 2022 NaN 13 2 2.0
13 2023 NaN 14 2 3.0
14 2024 NaN 15 2 4.0
15 2025 12.0 16 3 NaN
16 2026 NaN 17 3 1.0
17 2027 NaN 18 3 2.0发布于 2019-01-30 14:37:10
使用:
#check not missing values
m = df['treatment'].notnull()
#create groups starting not missing values
s = m.cumsum()
#add missing values for first group and for not missing values
mask = (s == 0) | m
#subtract score with first score per group
out = df['score'] - df['score'].groupby(s).transform('first')
#add missing values
df['year diff'] = np.where(mask, np.nan, out)
print (df)
year treatment score year diff
0 2010 NaN 1 NaN
1 2011 NaN 2 NaN
2 2012 NaN 3 NaN
3 2013 9.0 4 NaN
4 2014 NaN 5 1.0
5 2015 NaN 6 2.0
6 2016 NaN 7 3.0
7 2017 NaN 8 4.0
8 2018 NaN 9 5.0
9 2019 NaN 10 6.0
10 2020 10.0 11 NaN
11 2021 NaN 12 1.0
12 2022 NaN 13 2.0
13 2023 NaN 14 3.0
14 2024 NaN 15 4.0
15 2025 12.0 16 NaN
16 2026 NaN 17 1.0
17 2027 NaN 18 2.0发布于 2019-01-30 14:50:02
如果您希望使用for循环来执行此操作:
df = pd.DataFrame(mydata)
mylist = df.index[df['treatment'] != ''].tolist()现在我们减去year的值
re_list= []
for index,row in df.iterrows():
if index > min(mylist):
m = [i for i in mylist if i <= index]
re_list.append(df.iloc[index]['year'] - df.iloc[max(m)]['year'])
else:
re_list.append(0)
df['Result'] = re_listhttps://stackoverflow.com/questions/54434020
复制相似问题