我有一个DataFrame,df,它看起来像:
ID | TERM | DISC_1
1 | 2003-10 | ECON
1 | 2002-01 | ECON
1 | 2002-10 | ECON
2 | 2003-10 | CHEM
2 | 2004-01 | CHEM
2 | 2004-10 | ENGN
2 | 2005-01 | ENGN
3 | 2001-01 | HISTR
3 | 2002-10 | HISTR
3 | 2002-10 | HISTRID是一个学生ID,term是一个学术术语,DISC_1是他们所在专业的学科。对于每个学生,我希望确定他们何时(以及是否)更改DISC_1的术语,然后创建一个报告何时更改的新DataFrame。零表示它们没有更改。输出如下所示:
ID | Change
1 | 0
2 | 2004-01
3 | 0 我下面的代码可以工作,但它非常慢。我试着使用Groupby来做这件事,但是做不到。有人能解释一下我怎样才能更有效地完成这项任务吗?
df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
c = c + 1
if c > 1:
row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1), row['TERM'], 0)
last_PIDM = row['PIDM']
last_DISC_1 = row['DISC_1']
else:
row['change'] = 0
change.append(row['change'])
df['change'] = change
change_terms = df.groupby('PIDM')['change'].max()https://stackoverflow.com/questions/38338127
复制相似问题