df1 = pd.DataFrame({'type': ['cst1', 'cst1', 'cst1','cst1','cst2','cst2','cst2','cst3','cst3','cst3','cst3'],'year':[2017,2018,2019,2020,2018,2019,2020,2017,2018,2019,2020]})
type year
0 cst1 2017
1 cst1 2018
2 cst1 2019
3 cst1 2020
4 cst2 2018
5 cst2 2019
6 cst2 2020
7 cst3 2017
8 cst3 2018
9 cst3 2019
10 cst3 2020对于上述数据,需要检查每种类型的值,如果它存在于所有四年的2017,2018,2019,2020需要标签为1,其他wise 0。例:第一种类型cst1出现在所有4年中,标记为1,cst2只出现在3年,标记为1。注:理想情况下只包含4年i,e 2017 - 2020。类型和年份组合将是唯一的。
期望产出:
type year label
0 cst1 2017 1
1 cst1 2018 1
2 cst1 2019 1
3 cst1 2020 1
4 cst2 2018 0
5 cst2 2019 0
6 cst2 2020 0
7 cst3 2017 1
8 cst3 2018 1
9 cst3 2019 1
10 cst3 2020 1发布于 2021-05-09 13:16:10
我想,如果所有年份都在2017至2020年间,则用nunique进行群比/转换就可以了:
df['label'] = (df1.groupby('type').transform('nunique') == 4).astype(int)备选方案:
df1['label'] = 0
def test(x):
return set(x.values) == {2017,2018,2019,2020}
df1.iloc[df1.groupby('type')['year'].filter(test).index , 2] = 1发布于 2021-05-09 13:26:42
groupby()创建组transform()根据组获取每行年数的元组required = (2017,2018,2019,2020)
df1["label"] = (df1.groupby('type').transform(tuple)["year"] == required).astype('int')
print(df1)
type year label
0 cst1 2017 1
1 cst1 2018 1
2 cst1 2019 1
3 cst1 2020 1
4 cst2 2018 0
5 cst2 2019 0
6 cst2 2020 0
7 cst3 2017 1
8 cst3 2018 1
9 cst3 2019 1
10 cst3 2020 1发布于 2021-05-09 13:11:16
让我们试试:
astype(int)将布尔转换为1和0import pandas as pd
df1 = pd.DataFrame({'type': ['cst1', 'cst1', 'cst1', 'cst1', 'cst2', 'cst2',
'cst2', 'cst3', 'cst3', 'cst3', 'cst3'],
'year': [2017, 2018, 2019, 2020, 2018, 2019, 2020, 2017,
2018, 2019, 2020]})
years = {2017, 2018, 2019, 2020}
df1['label'] = (
df1.groupby('type').year.transform(lambda x: years.issubset(x))
).astype(int)
print(df1)df1
type year label
0 cst1 2017 1
1 cst1 2018 1
2 cst1 2019 1
3 cst1 2020 1
4 cst2 2018 0
5 cst2 2019 0
6 cst2 2020 0
7 cst3 2017 1
8 cst3 2018 1
9 cst3 2019 1
10 cst3 2020 1*注意,这将与至少有四年时间的任何一组人相匹配。因此,如果一个团体包括2016,2017,2018,2019,2020的参赛作品,它将被匹配。
https://stackoverflow.com/questions/67458024
复制相似问题