首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >正确下采样数据

正确下采样数据
EN

Stack Overflow用户
提问于 2022-09-13 08:46:30
回答 1查看 75关注 0票数 0

我有一个输入数据集,看到一个输入样本下面,我想降低它的样本。要做到这一点,我正在使用

代码语言:javascript
复制
resample_time=25
init_len = len(df.index)
df = df.set_index('time', drop=False).resample('{}S'.format(resample_time)).last().dropna()
df.index = range(0, len(df.index))

输出的示例如下所示。然而,我得到的输出不是预期的,见预期输出。那就是每25秒保持一排。你能解释一下为什么会发生这种事吗?我们如何解决它?

输入:

代码语言:javascript
复制
          lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12
1    116.317067  40.075217 2007-05-06 04:21:33
2    116.317233  40.075250 2007-05-06 04:21:53
3    116.317217  40.075417 2007-05-06 04:22:04
4    116.317133  40.075567 2007-05-06 04:22:23
5    116.317167  40.075400 2007-05-06 04:46:48
6    116.317233  40.075183 2007-05-06 04:46:54
7    116.317050  40.074933 2007-05-06 04:47:00
8    116.313567  40.073983 2007-05-06 04:47:36
9    116.311133  40.073167 2007-05-06 04:48:44
10   116.308017  40.072300 2007-05-06 04:49:15
11   116.307467  40.072483 2007-05-06 04:49:22
12   116.306250  40.074017 2007-05-06 04:49:45
13   116.306450  40.074283 2007-05-06 04:49:52

输出:

代码语言:javascript
复制
            lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12
1    116.317067  40.075217 2007-05-06 04:21:33
2    116.317217  40.075417 2007-05-06 04:22:04
3    116.317133  40.075567 2007-05-06 04:22:23
4    116.317050  40.074933 2007-05-06 04:47:00
5    116.313567  40.073983 2007-05-06 04:47:36
6    116.311133  40.073167 2007-05-06 04:48:44
7    116.307467  40.072483 2007-05-06 04:49:22
8    116.306450  40.074283 2007-05-06 04:49:52
9    116.308567  40.071850 2007-05-06 04:50:30
10   116.308667  40.071650 2007-05-06 04:50:57
11   116.310450  40.068850 2007-05-06 04:51:38
12   116.311800  40.067717 2007-05-06 04:52:02
13   116.312300  40.067067 2007-05-06 04:52:21
14   116.312667  40.066617 2007-05-06 04:52:32
15   116.312800  40.066450 2007-05-06 04:53:05
16   116.314067  40.064867 2007-05-06 04:53:38
17   116.314783  40.063667 2007-05-06 04:54:14
18   116.315867  40.062167 2007-05-06 04:54:41
19   116.318550  40.058583 2007-05-06 04:55:20

预期产出:

代码语言:javascript
复制
        lon        lat                time
0    116.317117  40.075417 2007-05-06 04:21:12 -> Include
1    116.317067  40.075217 2007-05-06 04:21:33 -> Exclude
2    116.317233  40.075250 2007-05-06 04:21:53 -> Include
3    116.317217  40.075417 2007-05-06 04:22:04 -> Exclude
4    116.317133  40.075567 2007-05-06 04:22:23 -> Include
5    116.317167  40.075400 2007-05-06 04:46:48 -> Include 
6    116.317233  40.075183 2007-05-06 04:46:54 -> Exclude
7    116.317050  40.074933 2007-05-06 04:47:00 -> Exclude
8    116.313567  40.073983 2007-05-06 04:47:36 -> Include
9    116.311133  40.073167 2007-05-06 04:48:44 -> Exclude
10   116.308017  40.072300 2007-05-06 04:49:15 -> Include
11   116.307467  40.072483 2007-05-06 04:49:22
12   116.306250  40.074017 2007-05-06 04:49:45
13   116.306450  40.074283 2007-05-06 04:49:52

PS:您可以使用.csv读取链接中的pd.read_csv(' 20070506033305.csv'), parse_dates=['time'])文件

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-09-13 12:03:54

根据您的预期输出,您似乎希望这样做:

  • 从顶部开始,将timedelta阈值设置为25秒,并找到第一个超过阈值的后续Timestamp
  • 根据新找到的值重置阈值,并一直持续到结束。

也许有更好的方法来实现这一结果,但下面的工作应该是有效的。

数据

代码语言:javascript
复制
import pandas as pd

data = {'lon': {0: 116.317117, 1: 116.317067, 2: 116.317233, 3: 116.317217, 
                4: 116.317133, 5: 116.317167, 6: 116.317233, 7: 116.31705, 
                8: 116.313567, 9: 116.311133, 10: 116.308017, 11: 116.307467, 
                12: 116.30625, 13: 116.30645}, 
        'lat': {0: 40.075417, 1: 40.075217, 2: 40.07525, 3: 40.075417, 
                4: 40.075567, 5: 40.0754, 6: 40.075183, 7: 40.074933, 
                8: 40.073983, 9: 40.073167, 10: 40.0723, 11: 40.072483, 
                12: 40.074017, 13: 40.074283}, 
        'time': {0: '2007-05-06 04:21:12', 1: '2007-05-06 04:21:33', 
                 2: '2007-05-06 04:21:53', 3: '2007-05-06 04:22:04', 
                 4: '2007-05-06 04:22:23', 5: '2007-05-06 04:46:48', 
                 6: '2007-05-06 04:46:54', 7: '2007-05-06 04:47:00', 
                 8: '2007-05-06 04:47:36', 9: '2007-05-06 04:48:44', 
                 10: '2007-05-06 04:49:15', 11: '2007-05-06 04:49:22', 
                 12: '2007-05-06 04:49:45', 13: '2007-05-06 04:49:52'}
        }

df = pd.DataFrame(data)

df['time'] = pd.to_datetime(df.time)

代码语言:javascript
复制
from datetime import timedelta

index_lst = [0]
for i in df.index[:-1]:
    if i in index_lst:
        check = df.time[i+1:].gt(df.time[i]+timedelta(seconds=25))
        if all(~check):
            break
        index_lst.append(check.idxmax())

res = df.iloc[index_lst].reset_index(drop=True)

print(res)

          lon        lat                time
0  116.317117  40.075417 2007-05-06 04:21:12
1  116.317233  40.075250 2007-05-06 04:21:53
2  116.317133  40.075567 2007-05-06 04:22:23
3  116.317167  40.075400 2007-05-06 04:46:48
4  116.313567  40.073983 2007-05-06 04:47:36
5  116.311133  40.073167 2007-05-06 04:48:44
6  116.308017  40.072300 2007-05-06 04:49:15
7  116.306250  40.074017 2007-05-06 04:49:45

解释

假设i == 0,我们得到:

代码语言:javascript
复制
df.time[1:].gt(df.time[0]+timedelta(seconds=25)).head()

1    False
2     True # namely: '2007-05-06 04:21:53'
3     True
4     True
5     True
  • 使用idxmax(),我们获得第一个True2,并将这个值附加到列表中。
  • 在下一次迭代中,我们将跳过1 (不在列表中),并在2 (现在在列表中)重新启动。

添加if all(~check): break可以让我们在知道再也找不到任何True值时走出For Loop

注:我在这里假设,以下电话是误读造成的:

代码语言:javascript
复制
8    116.313567  40.073983 2007-05-06 04:47:36 -> Include
9    116.311133  40.073167 2007-05-06 04:48:44 -> Exclude

9也应该包括在内(差1分8秒)。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73700030

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档