我有一个像这样的数据文件:
indx user_id type date
0 123 A Level-1 2021-01-15
1 123 A Level-1 2021-01-10
2 123 A Level-2 2021-01-10
3 123 B Level-2 2021-01-11
4 123 not_ctrgzd 2021-01-10
5 124 A Level-2 2021-02-11
6 124 B Level-1 2021-01-21
7 124 B Level-1+ 2021-02-11
8 125 not_ctrgzd 2021-01-31
9 126 A Level-1 2021-02-02
...我需要的是获得每种唯一类型的最近日期的行,即
indx user_id type date
0 123 A Level-1 2021-01-15
2 123 A Level-2 2021-01-10
3 123 B Level-2 2021-01-11
4 123 not_ctrgzd 2021-01-10
5 124 A Level-2 2021-02-11
6 124 B Level-1 2021-01-21
7 124 B Level-1+ 2021-02-11
8 125 not_ctrgzd 2021-01-31
9 126 A Level-1 2021-02-02下面的代码块就是这样做的
idx = df.groupby(['user_id','type'])['date'].transform(max) == df['date']
df[idx]现在,我不能做的是为每个类型(A、B等)获得最大类型值的行,这样最终数据就会像这样。
indx user_id type date
2 123 A Level-2 2021-01-10
3 123 B Level-2 2021-01-11
4 123 not_ctrgzd 2021-01-10
5 124 A Level-2 2021-02-11
7 124 B Level-1+ 2021-02-11
8 125 not_ctrgzd 2021-01-31
9 126 A Level-1 2021-02-02因为B级-1+大于B级-1,A级-2大于A级-1,依此类推.请注意,一些行没有分类类型(no_ctgrzd),无论什么情况,这些类型都应该包含在最终的数据框架中。请不要犹豫,纠正任何不合理的部分,您喜欢标题:)。谢谢!
发布于 2021-02-11 18:22:22
您可以用pd.CategoricalDtype这样做:
#Create a catoregy and order for type
catTypeDtype = pd.CategoricalDtype(['1','1+','2'], ordered=True)
#Split the type into two helper columns to sort on category
df[['t1','t2']] = df['type'].str.extract('(?P<t1>[AB]|(?:.*))(?P<t2>.*)')
#change dtype from string to categorical
df['t2'] = df['t2'].astype(catTypeDtype)
#Sort dataframe on categorical data and date
dfs = df.sort_values(['t2','date'], ascending=[False, False])
#Groupby and take the first record after sorting
df_out = dfs.groupby(['user_id','t1'], group_keys=False, as_index=False).first()\
.drop(['t1','t2'], axis=1)
df_out 输出:
user_id indx type date
0 123 2 A2 2021-01-10
1 123 3 B2 2021-01-11
2 123 4 not_ctrgzd 2021-01-10
3 124 5 A2 2021-02-11
4 124 6 B2 2021-01-21
5 125 8 not_ctrgzd 2021-01-31
6 126 9 A1 2021-02-02用新数据更新
catTypeDtype = pd.CategoricalDtype(['1','1+','2'], ordered=True)
df[['t1','t2']] = df['type'].str.extract('(?P<t1>[AB]|(?:.*))(?:\sLevel-)?(?P<t2>.*)')
# df
df['t2'] = df['t2'].astype(catTypeDtype)
dfs = df.sort_values(['t2','date'], ascending=[False, False])
df_out = dfs.groupby(['user_id','t1'], group_keys=False, as_index=False).first()\
.drop(['t1','t2'], axis=1)输出:
user_id indx type date
0 123 2 A Level-2 2021-01-10
1 123 3 B Level-2 2021-01-11
2 123 4 not_ctrgzd 2021-01-10
3 124 5 A Level-2 2021-02-11
4 124 7 B Level-1+ 2021-02-11
5 125 8 not_ctrgzd 2021-01-31
6 126 9 A Level-1 2021-02-02https://stackoverflow.com/questions/66159941
复制相似问题