尝试从dataframe删除正确的副本时遇到一些困难。
我有以下例子:
import numpy as np
import pandas as pd
test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10',
'2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
'value' : [123, '', 324, '', '', '', 321],}
df = pd.DataFrame(data=test)输出结果如下:
date value
0 2012-10-12 10:10:10 123
1 2012-10-12 10:10:10
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
4 2012-11-02 16:08:07
5 2012-12-12 23:45:21
6 2012-12-12 23:45:21 321删除重复日期后的所需的输出如下所示:
date value
0 2012-10-12 10:10:10 123
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
6 2012-12-12 23:45:21 321 然而,我迄今的努力没有成功,如下所示:
企图1:-
df = df.drop_duplicates(subset='date')
date value
0 2012-10-12 10:10:10 123
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
5 2012-12-12 23:45:21 企图2:-
df = df.drop_duplicates(subset='date', keep='last')
date value
1 2012-10-12 10:10:10
2 2012-10-19 10:55:10 324
4 2012-11-02 16:08:07
6 2012-12-12 23:45:21 321请你帮助我达到我想要的输出。事先非常感谢
发布于 2020-12-24 16:51:51
一种方法是隐藏列value中的空字符串,然后在date上进行groupby,然后使用first进行聚合。
df['value'].mask(df['value'].eq('')).groupby(df['date']).first().fillna('').reset_index()或者,您可以屏蔽列value中的空字符串,并将其分配给临时列key,然后对列date和key上的数据进行排序,然后是drop_duplicates。
df['key'] = df['value'].mask(df['value'].eq(''))
df.sort_values(['date', 'key']).drop_duplicates('date').drop('key', 1)结果:
date value
0 2012-10-12 10:10:10 123
1 2012-10-19 10:55:10 324
2 2012-11-02 16:08:07
3 2012-12-12 23:45:21 321发布于 2020-12-24 16:23:22
import numpy as np
import pandas as pd
test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10',
'2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
'value' : [123, np.nan, 324, np.nan, np.nan, np.nan, 321],}这应该会成功的!
df = pd.DataFrame(data=test)
df.sort_values(by = "value", inplace = True)
df = df.drop_duplicates(subset='date')
df = df.replace(np.nan, '', regex=True)
df.sort_index()输出结果如下:
date value
0 2012-10-12 10:10:10 123
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
6 2012-12-12 23:45:21 321 发布于 2020-12-24 16:28:58
import pandas as pd
test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10',
'2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
'value' : [123, '', 324, '', '', '', 321],}
df = pd.DataFrame(data=test)
df["value_not_empty"] = df['value'].map(bool)
df = df.sort_values("value_not_empty")
df = df.drop(columns=["value_not_empty"])
df = df.drop_duplicates('date', keep='last')
df

https://stackoverflow.com/questions/65440734
复制相似问题