首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何找出大熊猫数据栏(日期格式)中的空白?

如何找出大熊猫数据栏(日期格式)中的空白?
EN

Stack Overflow用户
提问于 2018-06-20 07:17:58
回答 1查看 3K关注 0票数 2

我有一只熊猫,如下所示:

代码语言:javascript
复制
name,year
AAA,2015-11-02 22:00:00
AAA,2015-11-02 23:00:00
AAA,2015-11-03 00:00:00
AAA,2015-11-03 01:00:00
AAA,2015-11-03 02:00:00
AAA,2015-11-03 05:00:00
ZZZ,2015-09-01 00:00:00
ZZZ,2015-11-01 01:00:00
ZZZ,2015-11-01 07:00:00
ZZZ,2015-11-01 08:00:00
ZZZ,2015-11-01 09:00:00
ZZZ,2015-11-01 12:00:00

我想找出有关特定名称的dataframe的年份列中可用的空白。例如,

  1. AAA名称在"2015-11-03 02:00:00“日期之前有差距,日期2小时。
  2. ZZZ的名字在"2015-11-01 :00:00:00“之前有差距,日期为5小时。
  3. ZZZ的名字在"2015-11-01 :00:00:00“日期之前有空档,日期为2小时。

我想生成两个包含内容的csv文件:

CSV-1:

代码语言:javascript
复制
name,year
AAA,2015-11-02 22:00:00,0
AAA,2015-11-02 23:00:00,0
AAA,2015-11-03 00:00:00,0
AAA,2015-11-03 01:00:00,0
AAA,2015-11-03 02:00:00,2
AAA,2015-11-03 05:00:00,0
ZZZ,2015-09-01 00:00:00,0
ZZZ,2015-11-01 01:00:00,5
ZZZ,2015-11-01 07:00:00,0
ZZZ,2015-11-01 08:00:00,0
ZZZ,2015-11-01 09:00:00,2
ZZZ,2015-11-01 12:00:00,0

CSV-2:

代码语言:javascript
复制
name,prev_year,next_year,gaps
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 03:00:00
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 02:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 03:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 05:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 06:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 10:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 11:00:00

我试过如下:

代码语言:javascript
复制
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
mask = df.groupby("name").year.diff() > pd.Timedelta('0 days 01:00:00')
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-20 07:25:15

要将空白输入数据,您需要重新分配生成的mask。要从总时数中得到这一点,只需将1小时除以:

代码语言:javascript
复制
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = (df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).fillna(0)

这给我们提供了以下数据:

代码语言:javascript
复制
   name                year     Gap
0   AAA 2015-11-02 22:00:00     0.0
1   AAA 2015-11-02 23:00:00     1.0
2   AAA 2015-11-03 00:00:00     1.0
3   AAA 2015-11-03 01:00:00     1.0
4   AAA 2015-11-03 02:00:00     1.0
5   AAA 2015-11-03 05:00:00     3.0
6   ZZZ 2015-09-01 00:00:00     0.0
7   ZZZ 2015-11-01 07:00:00     6.0
8   ZZZ 2015-11-01 08:00:00     1.0
9   ZZZ 2015-11-01 09:00:00     1.0
10  ZZZ 2015-11-01 12:00:00     3.0

为了获得它的起始时间附近的空白,并与"csv-1“所需的方式保持一致,我们只需在填充na值之前将其移到一行并减去一行:

代码语言:javascript
复制
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)

这会得到:

代码语言:javascript
复制
   name                year  Gap
0   AAA 2015-11-02 22:00:00  0.0
1   AAA 2015-11-02 23:00:00  0.0
2   AAA 2015-11-03 00:00:00  0.0
3   AAA 2015-11-03 01:00:00  0.0
4   AAA 2015-11-03 02:00:00  2.0
5   AAA 2015-11-03 05:00:00  0.0
6   ZZZ 2015-11-01 01:00:00  5.0
7   ZZZ 2015-11-01 07:00:00  0.0
8   ZZZ 2015-11-01 08:00:00  0.0
9   ZZZ 2015-11-01 09:00:00  2.0
10  ZZZ 2015-11-01 12:00:00  0.0

为了获得您的第二个csv,我们可以执行以下操作:

代码语言:javascript
复制
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)

df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
       .resample(rule='1H')\
       .ffill()\
       .reset_index()

gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]

gaps.rename({'year': 'gaps'}, index='columns', inplace=True)

首先,我们设置“前”和“后”列。然后,通过将索引更改为'year',我们可以使用.resample()方法来填充所有缺少的小时。通过在重采样时使用ffill(),我们将最后可用的记录复制到添加的所有新行中。我们知道,当'prev_year' != 'year'时,我们所处的行是以前在帧中不存在的,因此是空白之一,所以我们过滤到那些行,选择我们需要的列并重命名它们。这意味着:

代码语言:javascript
复制
   name           prev_year           next_year                year
5   AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 03:00:00
6   AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 04:00:00
9   ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 02:00:00
10  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 03:00:00
11  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 04:00:00
12  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 05:00:00
13  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 06:00:00
17  ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 10:00:00
18  ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 11:00:00

总之,您的脚本可以如下所示:

代码语言:javascript
复制
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)

df.to_csv('csv-1.csv', index=False)

df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)

df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
       .resample(rule='1H')\
       .ffill()\
       .reset_index()

gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]

gaps.rename({'year': 'gaps'}, index='columns', inplace=True)

gaps.to_csv('csv-2.csv', index=False)
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50942418

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档