首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Pandas中数据的指示变量

Pandas中数据的指示变量
EN

Stack Overflow用户
提问于 2022-08-27 15:52:56
回答 1查看 100关注 0票数 0

关于一组学生过去的考试成绩,我有以下数据

代码语言:javascript
复制
df = pd.DataFrame({
     'Student_ID': [1,2,1,1,2,3,1,3,3,3,2,2],
     'Date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28','2002-07-22'],
     'Subject': ['algebra','geometry','geometry','geometry','calculus','calculus','geometry','calculus','calculus','algebra','algebra','geometry']
})

我想增加一个名为Advantage的新专栏,如果学生在最近的3次考试中至少有2次参加了相同的科目,那么它等于1,而0则不然。所以输出看起来就像

代码语言:javascript
复制
    Student_ID  Date        Subject  Advantage
6    1          2022-01-12  geometry 1        
3    1          2021-11-28  geometry 0            
0    1          2021-04-28  algebra  0         
2    1          2011-03-01  geometry 0             
1    2          2022-05-21  geometry 0           
10   2          2009-05-28  algebra  0
11   2          2002-07-22  geometry 0           
4    2          1992-12-01  calculus 0           
9    3          2022-01-01  algebra  0           
7    3          2019-02-28  calculus 1           
8    3          2001-03-28  calculus 0
5    3          1999-10-28  calculus 0

下面是我尝试过的:首先,我按时间对数据进行排序:

代码语言:javascript
复制
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.sort_values(by=['Student_ID', 'Date'], ascending = [True, False], inplace=True)

然后,我认为使用np.wherenp.select函数可能会有帮助,但我被卡住了。提前谢谢你。

编辑:下面是该示例的更长版本:

代码语言:javascript
复制
df = pd.DataFrame({
     'Student_ID': [1,2,1,1,2,3,1,3,3,3,2,2,2,2,2,2],
     'Date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28','2002-07-22', '2005-7-01', '2004-7-22', '2012-03-22', '2009-10-12'],
     'Subject': ['algebra','geometry','geometry','geometry','calculus','calculus','geometry','calculus','calculus','algebra','algebra','geometry', 'calculus', 'calculus', 'geometry', 'geometry'],
     'Advantage':  [1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0]
})

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.sort_values(by=['Student_ID', 'Date'], ascending = [True, False], inplace=True)
df

所需的输出如下:

代码语言:javascript
复制
Student_ID  Date    Subject Advantage
6   1   2022-01-12  geometry    1
3   1   2021-11-28  geometry    0
0   1   2021-04-28  algebra     0
2   1   2011-03-01  geometry    0
1   2   2022-05-21  geometry    1
14  2   2012-03-22  geometry    0
15  2   2009-10-12  geometry    0
10  2   2009-05-28  algebra     0
12  2   2005-07-01  calculus    1
13  2   2004-07-22  calculus    0
11  2   2002-07-22  geometry    0
4   2   1992-12-01  calculus    0
9   3   2022-01-01  algebra     0
7   3   2019-02-28  calculus    1
8   3   2001-03-28  calculus    0
5   3   1999-10-28  calculus    0
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-08-27 16:23:54

代码语言:javascript
复制
# Import Data
df = pd.DataFrame({
     'Student_ID': [1,2,1,1,2,3,1,3,3,3,2,2,2,2,2,2],
     'Date': ['2021-04-28','2022-05-21','2011-03-01','2021-11-28','1992-12-01','1999-10-28','2022-01-12','2019-02-28','2001-03-28','2022-01-01','2009-05-28','2002-07-22', '2005-7-01', '2004-7-22', '2012-03-22', '2009-10-12'],
     'Subject': ['algebra','geometry','geometry','geometry','calculus','calculus','geometry','calculus','calculus','algebra','algebra','geometry', 'calculus', 'calculus', 'geometry', 'geometry'],
    #  'Advantage':  [1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0]
})

df.sort_values(['Student_ID','Date'],ascending=[True,False] ,inplace=True)
df.reset_index(inplace=True)
df['Index_Link']=df.reset_index().index

# Create Groups for Different Groups
df['Group'] = (df[['Student_ID', 'Subject']] != df[['Student_ID', 'Subject']].shift()).any(axis=1)
df['Group'] = df['Group'].cumsum()

# df['Group'].value_counts()
df_required = pd.DataFrame(df['Group'].value_counts())
df_required_lkp =  df_required.loc[df_required['Group'] >=2].reset_index()
df_required_v1 = pd.merge(df,df_required_lkp[['index','Group']], how='inner',left_on='Group', right_on='index')
df_required_v2 = df_required_v1.groupby(['Student_ID','Group_x']).first().reset_index()
df_required_v2['Advantage_Final'] = 1
# df_required_v2

df_final = pd.merge(df,df_required_v2[['Index_Link','Advantage_Final']], how='left',left_on='Index_Link', right_on='Index_Link')
df_final['Advantage_Final'] = df_final['Advantage_Final'].fillna(0)
df_final.drop(['index','Group','Index_Link'],axis=1,inplace=True)
df_final.rename(columns={"Advantage_Final":"Advantage"},inplace=True)

# Output Showing
df_final

嗨Nayr

我想这就是你想要的(没有循环)

如果你有什么问题,请告诉我

谢谢里昂

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73512436

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档