我有这样的DataFrame:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
226868 2948262 2017-11-11 -1.000 <- not duplicated
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250我必须删除stock_qty不同但product_id值相同的行。所以我应该让DataFrame像这样:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
244543 2948269 2017-11-11 2.500
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250谢谢你帮忙!
发布于 2017-11-15 16:10:12
@jezrael解决方案是最优的,但另一种方法是使用groupby和filter
df.groupby(['product_id','stock_qty']).filter(lambda x: len(x)>1)输出:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250发布于 2017-11-15 16:15:44
通过使用drop_duplicates
df.drop(df.drop_duplicates(['stock_qty', 'product_id'], keep=False).index)
Out[797]:
product_id dt stock_qty
226870 2948259 2017-11-11 17.000
233645 2948259 2017-11-12 17.000
240572 2948260 2017-11-13 5.000
247452 2948260 2017-11-14 5.000
233644 2948260 2017-11-12 5.000
226869 2948260 2017-11-11 5.000
247451 2948262 2017-11-14 -2.000
240571 2948262 2017-11-13 -2.000
240570 2948264 2017-11-13 5.488
233643 2948264 2017-11-12 5.488
247450 2948276 2017-11-14 3.250
226867 2948276 2017-11-11 3.250发布于 2021-07-20 13:51:50
使用loc[],您可以只过滤重复的行,并将其分配给原始数据。
df = df.loc[df.duplicated(subset=['product_id','stock_qty'], keep=False)]此外,keep=False参数将所有重复的行标记为True,如果您希望只使用第一次或最后一次使用keep='first'或keep='last'
https://stackoverflow.com/questions/47312040
复制相似问题