关于一组学生过去的考试成绩,我有以下数据
df = pd.DataFrame({
'Student_ID': [1,2,1,1,2,3,1,3,3,3,2,2],
'Date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28','2002-07-22'],
'Subject': ['algebra','geometry','geometry','geometry','calculus','calculus','geometry','calculus','calculus','algebra','algebra','geometry']
})我想增加一个名为Advantage的新专栏,如果学生在最近的3次考试中至少有2次参加了相同的科目,那么它等于1,而0则不然。所以输出看起来就像
Student_ID Date Subject Advantage
6 1 2022-01-12 geometry 1
3 1 2021-11-28 geometry 0
0 1 2021-04-28 algebra 0
2 1 2011-03-01 geometry 0
1 2 2022-05-21 geometry 0
10 2 2009-05-28 algebra 0
11 2 2002-07-22 geometry 0
4 2 1992-12-01 calculus 0
9 3 2022-01-01 algebra 0
7 3 2019-02-28 calculus 1
8 3 2001-03-28 calculus 0
5 3 1999-10-28 calculus 0下面是我尝试过的:首先,我按时间对数据进行排序:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.sort_values(by=['Student_ID', 'Date'], ascending = [True, False], inplace=True)然后,我认为使用np.where或np.select函数可能会有帮助,但我被卡住了。提前谢谢你。
编辑:下面是该示例的更长版本:
df = pd.DataFrame({
'Student_ID': [1,2,1,1,2,3,1,3,3,3,2,2,2,2,2,2],
'Date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28','2002-07-22', '2005-7-01', '2004-7-22', '2012-03-22', '2009-10-12'],
'Subject': ['algebra','geometry','geometry','geometry','calculus','calculus','geometry','calculus','calculus','algebra','algebra','geometry', 'calculus', 'calculus', 'geometry', 'geometry'],
'Advantage': [1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0]
})
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.sort_values(by=['Student_ID', 'Date'], ascending = [True, False], inplace=True)
df所需的输出如下:
Student_ID Date Subject Advantage
6 1 2022-01-12 geometry 1
3 1 2021-11-28 geometry 0
0 1 2021-04-28 algebra 0
2 1 2011-03-01 geometry 0
1 2 2022-05-21 geometry 1
14 2 2012-03-22 geometry 0
15 2 2009-10-12 geometry 0
10 2 2009-05-28 algebra 0
12 2 2005-07-01 calculus 1
13 2 2004-07-22 calculus 0
11 2 2002-07-22 geometry 0
4 2 1992-12-01 calculus 0
9 3 2022-01-01 algebra 0
7 3 2019-02-28 calculus 1
8 3 2001-03-28 calculus 0
5 3 1999-10-28 calculus 0发布于 2022-08-27 16:23:54
# Import Data
df = pd.DataFrame({
'Student_ID': [1,2,1,1,2,3,1,3,3,3,2,2,2,2,2,2],
'Date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28','2002-07-22', '2005-7-01', '2004-7-22', '2012-03-22', '2009-10-12'],
'Subject': ['algebra','geometry','geometry','geometry','calculus','calculus','geometry','calculus','calculus','algebra','algebra','geometry', 'calculus', 'calculus', 'geometry', 'geometry'],
# 'Advantage': [1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0]
})
df.sort_values(['Student_ID','Date'],ascending=[True,False] ,inplace=True)
df.reset_index(inplace=True)
df['Index_Link']=df.reset_index().index
# Create Groups for Different Groups
df['Group'] = (df[['Student_ID', 'Subject']] != df[['Student_ID', 'Subject']].shift()).any(axis=1)
df['Group'] = df['Group'].cumsum()
# df['Group'].value_counts()
df_required = pd.DataFrame(df['Group'].value_counts())
df_required_lkp = df_required.loc[df_required['Group'] >=2].reset_index()
df_required_v1 = pd.merge(df,df_required_lkp[['index','Group']], how='inner',left_on='Group', right_on='index')
df_required_v2 = df_required_v1.groupby(['Student_ID','Group_x']).first().reset_index()
df_required_v2['Advantage_Final'] = 1
# df_required_v2
df_final = pd.merge(df,df_required_v2[['Index_Link','Advantage_Final']], how='left',left_on='Index_Link', right_on='Index_Link')
df_final['Advantage_Final'] = df_final['Advantage_Final'].fillna(0)
df_final.drop(['index','Group','Index_Link'],axis=1,inplace=True)
df_final.rename(columns={"Advantage_Final":"Advantage"},inplace=True)
# Output Showing
df_final嗨Nayr
我想这就是你想要的(没有循环)
如果你有什么问题,请告诉我
谢谢里昂
https://stackoverflow.com/questions/73512436
复制相似问题