我有一个具有以下结构的pd.DataFrame:
$ df.head()
target refTime name latitude longitude
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993
3 5.0 2020-05-31 22:47:00 SUMU -34.830002 -56.000000
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001 这是二零二零年五月三十一日多个气象站录得的天气资料。可能有重复的(站)名称,例如SUMU。但是,每个复制都有一个不同的refTime。
问题:如何在保留最接近“全时”(在本例中为23:00:00)的观测时,删除重复的台站记录?对于这个特殊的例子,我最终会删除第3行。
发布于 2020-06-09 13:58:17
圆整值使用Series.dt.round,ref_time减去,Series.abs创建绝对值,最小值索引使用DataFrameGroupBy.idxmin,这样DataFrame.loc就可以选择
idx = (df["refTime"].dt.round("H").sub(df["refTime"]).abs()
.groupby(df['name'], sort=False).idxmin())
df = df.loc[idx]
print (df)
target refTime name latitude longitude
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001详细信息
print (df["refTime"].dt.round("H").sub(df["refTime"]))
0 00:12:00
1 00:09:00
2 00:04:00
3 00:13:00
4 00:14:00
5 00:00:00
6 00:00:00
7 00:01:00
8 00:00:00
9 00:00:00
10 00:10:00
Name: refTime, dtype: timedelta64[ns]print (idx)
name
YMLT 0
YWGT 1
YMAY 2
SUMU 7
FACT 4
SGES 5
SGAS 6
NFFN 8
SBPS 9
NSTU 10
Name: refTime, dtype: int64发布于 2020-06-09 14:00:41
让我们试试assign和dt.round
我们的想法是根据值与小时之间的距离对值进行排序,然后对值进行排序,并保留第一个实例。
import pandas as pd
import numpy as np
df2 = (
df.assign(
hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
)
.sort_values("hour_diff")
.drop_duplicates(subset=["name"], keep="first")
.drop("hour_diff", axis=1)
)
print(df2)
target refTime name latitude longitude
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000距离度量将如下所示:
df.assign(
hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
)
target refTime name latitude longitude hour_diff
0 5.0 2020-05-31 22:48:00 YMLT -41.529999 147.190002 12.0
1 6.0 2020-05-31 22:51:00 YWGT -36.419998 146.300003 9.0
2 6.0 2020-05-31 22:56:00 YMAY -36.060001 146.929993 4.0
3 5.0 2020-05-31 22:47:00 SUMU -34.830002 -56.000000 13.0 # we drop this
4 3.0 2020-05-31 22:46:00 FACT -33.990002 18.600000 14.0
5 5.0 2020-05-31 23:00:00 SGES -25.450001 -54.849998 0.0
6 5.0 2020-05-31 23:00:00 SGAS -25.250000 -57.520000 0.0
7 5.0 2020-05-31 22:59:00 SUMU -34.830002 -56.000000 1.0 # we keep this one
8 8.0 2020-05-31 23:00:00 NFFN -17.750000 177.449997 0.0
9 7.0 2020-05-31 23:00:00 SBPS -16.430000 -39.080002 0.0
10 7.0 2020-05-31 22:50:00 NSTU -14.330000 -170.720001 10.0发布于 2020-06-09 13:54:25
我们可以使用.drop_duplicates()函数删除与另一行重复的所有行。在您的例子中,调用df.drop_duplicates()
https://stackoverflow.com/questions/62284231
复制相似问题