我的问题类似于Pandas: Selecting rows based on value counts of a particular column,但有两列:
这是来自dataframe的一个非常小的片段(主df包含数百万个条目):
overall vote verified reviewTime reviewerID productID
4677505 5.0 NaN True 11 28, 2017 A2O8EJJBFJ9F1 B00NR2VMNC
1302483 5.0 NaN True 04 1, 2017 A1YMYW7EWN4RL3 B001S2PPT0
5073908 3.0 83 True 02 12, 2016 A3H796UY7GIX0K B00ULRFQ1A
200512 5.0 NaN True 07 14, 2016 A150W68P8PYXZE B0000DC0T3
1529831 5.0 NaN True 12 19, 2013 A28GVVNJUZ3VFA B002WE3BZ8
1141922 5.0 NaN False 12 20, 2008 A2UOHALGF2X77Q B001CCLBSA
5930187 3.0 2 True 05 21, 2018 A2CUSR21CZQ6J7 B01DCDG9JC
1863730 5.0 NaN True 05 6, 2017 A38A3VQL8RLS8D B004HKIB6E
1835030 5.0 NaN True 06 20, 2016 A30QT3MWWEPNIE B004D09HRK
4226935 5.0 NaN True 12 27, 2015 A3UORFPF49N96B B00JP12170现在我要过滤数据帧,使每个reviewerID和productID在最终过滤的数据帧中至少出现k次(假设是k=2)。换句话说:每个用户和产品至少有k个不同的条目/行。
如果有任何帮助,我将不胜感激。
发布于 2021-08-12 14:01:02
试试这条路
k=2
df = pd.read_csv('text.csv')
df['count']=1
df_group = df[['reviewerID','productID','count']].groupby(['reviewerID','productID'],as_index=False).sum()
df_group = df_group[df_group['count']>=k]
df_group.drop(['count'],axis=1,inplace=True)
df.drop(['count'],axis=1,inplace=True)
df = df.merge(df_group,on=['reviewerID','productID'])
df希望这会有所帮助
https://stackoverflow.com/questions/68758680
复制相似问题