我有一个输入数据集,看到一个输入样本下面,我想降低它的样本。要做到这一点,我正在使用
resample_time=25
init_len = len(df.index)
df = df.set_index('time', drop=False).resample('{}S'.format(resample_time)).last().dropna()
df.index = range(0, len(df.index))输出的示例如下所示。然而,我得到的输出不是预期的,见预期输出。那就是每25秒保持一排。你能解释一下为什么会发生这种事吗?我们如何解决它?
输入:
lon lat time
0 116.317117 40.075417 2007-05-06 04:21:12
1 116.317067 40.075217 2007-05-06 04:21:33
2 116.317233 40.075250 2007-05-06 04:21:53
3 116.317217 40.075417 2007-05-06 04:22:04
4 116.317133 40.075567 2007-05-06 04:22:23
5 116.317167 40.075400 2007-05-06 04:46:48
6 116.317233 40.075183 2007-05-06 04:46:54
7 116.317050 40.074933 2007-05-06 04:47:00
8 116.313567 40.073983 2007-05-06 04:47:36
9 116.311133 40.073167 2007-05-06 04:48:44
10 116.308017 40.072300 2007-05-06 04:49:15
11 116.307467 40.072483 2007-05-06 04:49:22
12 116.306250 40.074017 2007-05-06 04:49:45
13 116.306450 40.074283 2007-05-06 04:49:52输出:
lon lat time
0 116.317117 40.075417 2007-05-06 04:21:12
1 116.317067 40.075217 2007-05-06 04:21:33
2 116.317217 40.075417 2007-05-06 04:22:04
3 116.317133 40.075567 2007-05-06 04:22:23
4 116.317050 40.074933 2007-05-06 04:47:00
5 116.313567 40.073983 2007-05-06 04:47:36
6 116.311133 40.073167 2007-05-06 04:48:44
7 116.307467 40.072483 2007-05-06 04:49:22
8 116.306450 40.074283 2007-05-06 04:49:52
9 116.308567 40.071850 2007-05-06 04:50:30
10 116.308667 40.071650 2007-05-06 04:50:57
11 116.310450 40.068850 2007-05-06 04:51:38
12 116.311800 40.067717 2007-05-06 04:52:02
13 116.312300 40.067067 2007-05-06 04:52:21
14 116.312667 40.066617 2007-05-06 04:52:32
15 116.312800 40.066450 2007-05-06 04:53:05
16 116.314067 40.064867 2007-05-06 04:53:38
17 116.314783 40.063667 2007-05-06 04:54:14
18 116.315867 40.062167 2007-05-06 04:54:41
19 116.318550 40.058583 2007-05-06 04:55:20预期产出:
lon lat time
0 116.317117 40.075417 2007-05-06 04:21:12 -> Include
1 116.317067 40.075217 2007-05-06 04:21:33 -> Exclude
2 116.317233 40.075250 2007-05-06 04:21:53 -> Include
3 116.317217 40.075417 2007-05-06 04:22:04 -> Exclude
4 116.317133 40.075567 2007-05-06 04:22:23 -> Include
5 116.317167 40.075400 2007-05-06 04:46:48 -> Include
6 116.317233 40.075183 2007-05-06 04:46:54 -> Exclude
7 116.317050 40.074933 2007-05-06 04:47:00 -> Exclude
8 116.313567 40.073983 2007-05-06 04:47:36 -> Include
9 116.311133 40.073167 2007-05-06 04:48:44 -> Exclude
10 116.308017 40.072300 2007-05-06 04:49:15 -> Include
11 116.307467 40.072483 2007-05-06 04:49:22
12 116.306250 40.074017 2007-05-06 04:49:45
13 116.306450 40.074283 2007-05-06 04:49:52PS:您可以使用.csv读取链接中的pd.read_csv(' 20070506033305.csv'), parse_dates=['time'])文件
发布于 2022-09-13 12:03:54
根据您的预期输出,您似乎希望这样做:
timedelta阈值设置为25秒,并找到第一个超过阈值的后续Timestamp。也许有更好的方法来实现这一结果,但下面的工作应该是有效的。
数据
import pandas as pd
data = {'lon': {0: 116.317117, 1: 116.317067, 2: 116.317233, 3: 116.317217,
4: 116.317133, 5: 116.317167, 6: 116.317233, 7: 116.31705,
8: 116.313567, 9: 116.311133, 10: 116.308017, 11: 116.307467,
12: 116.30625, 13: 116.30645},
'lat': {0: 40.075417, 1: 40.075217, 2: 40.07525, 3: 40.075417,
4: 40.075567, 5: 40.0754, 6: 40.075183, 7: 40.074933,
8: 40.073983, 9: 40.073167, 10: 40.0723, 11: 40.072483,
12: 40.074017, 13: 40.074283},
'time': {0: '2007-05-06 04:21:12', 1: '2007-05-06 04:21:33',
2: '2007-05-06 04:21:53', 3: '2007-05-06 04:22:04',
4: '2007-05-06 04:22:23', 5: '2007-05-06 04:46:48',
6: '2007-05-06 04:46:54', 7: '2007-05-06 04:47:00',
8: '2007-05-06 04:47:36', 9: '2007-05-06 04:48:44',
10: '2007-05-06 04:49:15', 11: '2007-05-06 04:49:22',
12: '2007-05-06 04:49:45', 13: '2007-05-06 04:49:52'}
}
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df.time)码
from datetime import timedelta
index_lst = [0]
for i in df.index[:-1]:
if i in index_lst:
check = df.time[i+1:].gt(df.time[i]+timedelta(seconds=25))
if all(~check):
break
index_lst.append(check.idxmax())
res = df.iloc[index_lst].reset_index(drop=True)
print(res)
lon lat time
0 116.317117 40.075417 2007-05-06 04:21:12
1 116.317233 40.075250 2007-05-06 04:21:53
2 116.317133 40.075567 2007-05-06 04:22:23
3 116.317167 40.075400 2007-05-06 04:46:48
4 116.313567 40.073983 2007-05-06 04:47:36
5 116.311133 40.073167 2007-05-06 04:48:44
6 116.308017 40.072300 2007-05-06 04:49:15
7 116.306250 40.074017 2007-05-06 04:49:45解释
假设i == 0,我们得到:
df.time[1:].gt(df.time[0]+timedelta(seconds=25)).head()
1 False
2 True # namely: '2007-05-06 04:21:53'
3 True
4 True
5 Trueidxmax(),我们获得第一个True的2,并将这个值附加到列表中。1 (不在列表中),并在2 (现在在列表中)重新启动。添加if all(~check): break可以让我们在知道再也找不到任何True值时走出For Loop。
注:我在这里假设,以下电话是误读造成的:
8 116.313567 40.073983 2007-05-06 04:47:36 -> Include
9 116.311133 40.073167 2007-05-06 04:48:44 -> Exclude但9也应该包括在内(差1分8秒)。
https://stackoverflow.com/questions/73700030
复制相似问题