目前,我正试图解决一个问题,我被困在了一个起点上。我有一个包含4列的数据帧。我想试着找到按天和id重叠的时间。例如,我的df如下所示:
+------+--------------+-----------------------+----------------------+
| id | date | time_start | end_time |
+--------------------------------------------------------------------+
| 123 | 2019-11-10 | 2019-11-10 08:00:00 | 2019-11-10 08:30:00 |
| | | | |
| 123 | 2019-11-10 | 2019-11-10 08:15:00 | 2019-11-10 08:30:00 |
| | | | |
| 123 | 2019-11-10 | 2019-11-10 08:25:00 | 2019-11-10 08:45:00 |
| | | | |
| 123 | 2019-11-11 | 2019-11-11 08:00:00 | 2019-11-11 08:30:00 |
| | | | |
| 123 | 2019-11-11 | 2019-11-11 08:30:00 | 2019-11-11 09:00:00 |
+------+--------------+-----------------------+----------------------+
import pandas as pd
data = {'id':['123', '123', '123', '123','123'], 'date':['2019-11-10', '2019-11-10', '2019-11-10', '2019-11-11', '2019-11-11'],
'time_start':['2019-11-10 08:00:00', '2019-11-10 08:15:00', '2019-11-10 08:25:00', '2019-11-11 08:00:00', '2019-11-11 08:30:00'],
'end_time':['2019-11-10 08:30:00','2019-11-10 08:30:00','2019-11-10 08:45:00','2019-11-11 08:30:00','2019-11-11 09:00:00']}
df = pd.DataFrame(data)
,id,date,time_start,end_time
0,123,2019-11-10,2019-11-10 08:00:00,2019-11-10 08:30:00
1,123,2019-11-10,2019-11-10 08:15:00,2019-11-10 08:30:00
2,123,2019-11-10,2019-11-10 08:25:00,2019-11-10 08:45:00
3,123,2019-11-11,2019-11-11 08:00:00,2019-11-11 08:30:00
4,123,2019-11-11,2019-11-11 08:30:00,2019-11-11 09:00:00我希望看到类似如下的结果:
+----+------------+----------------------+---------------------+---------------+-------------------------+-----------------+
|id | date | time_start | time_end | overlap_count | total_minutes_recorded | actual_minutes|
+--------------------------------------------------------------------------------------------------------------------------+
|123 | 2019-11-10 | 2019-11-10 08:00:00 | 2019-11-10 08:45:00 | 3 | 65 | 45 |
| | | | | | | |
|123 | 2019-11-11 | 2019-11-11 08:00:00 | 2019-11-11 09:00:00 | 0 | 60 | 60 |
+----+------------+----------------------+---------------------+---------------+-------------------------+-----------------+我查看了其他答案,这些答案开始让我对如何解决这个问题有了洞察力,例如:
Pandas: Count time interval intersections over a group by
这些答案中的大多数只是给了我一个重叠时间的计数,而且它需要很长时间才能计算出来。如何开始解决这个问题有什么建议吗?
发布于 2020-08-13 16:05:03
我不知道你是如何在第一行得到overlap_count和total_minutes_recorded的值的,我想这是错误的
df= pd.DataFrame({
'id':[123,123,123,123,123],
'date':['2019-11-10','2019-11-10','2019-11-10','2019-11-11','2019-11-11'],
'time_start':['2019-11-10 08:00:00','2019-11-10 08:15:00','2019-11-10 08:25:00',
'2019-11-11 08:00:00','2019-11-11 08:30:00'],
'end_time':['2019-11-10 08:30:00','2019-11-10 08:30:00','2019-11-10 08:45:00',
'2019-11-11 08:30:00','2019-11-11 09:00:00']
})
df['date'] = pd.to_datetime(df['date'])
df['time_start'] = pd.to_datetime(df['time_start'])
df['end_time'] = pd.to_datetime(df['end_time'])
df_temp=df
df = pd.merge(df,df_temp,on='id')
df=df[
((df.time_start_x - df.time_start_y) == np.timedelta64(1, 'D'))
]
df_temp=df[['id','date_x','time_start_x','end_time_x']]
df_temp1 = df[['id','date_y','time_start_y','end_time_y']]
df_temp=df_temp.rename(columns={"date_x": "date", "time_start_x": "time_start",
"end_time_x":"end_time"})
df_temp1=df_temp1.rename(columns={"date_y": "date", "time_start_y": "time_start",
"end_time_y":"end_time"})
df=pd.concat([df_temp,df_temp1])
df=df[['id','date','time_start','end_time']].sort_values(by='date')
df['total_minutes_recorded'] = df['end_time']-df['time_start']
print(df)
id date time_start end_time total_minutes_recorded
15 123 2019-11-10 2019-11-10 08:00:00 2019-11-10 08:30:00 00:30:00
15 123 2019-11-11 2019-11-11 08:00:00 2019-11-11 08:30:00 00:30:00发布于 2020-08-13 15:14:28
使用groupby按日期分组,然后定义一个将每个日期作为数据帧的函数。我把get_minutes_recorded给你。get_overlap_counts稍微复杂一些-你可以通过在每个索引中保持一个0的向量来解决这个问题,循环你所有的日期,i,如果行i的end_date[i]在行n的开始和结束之间,那么就生成vector[n] = 1。
def function(sub_df):
overlap_count = get_overlap_count(sub_df)
total_minutes_recorded = get_minutes_recorded(sub_df)
return overlap_count, total_minutes_recorded
def get_overlap_counts(df):
pass
def get_minutes_recorded(df):
return (df[end_time] - df[start_time]).dt.seconds.sum()
df.groupby('date').apply(function)https://stackoverflow.com/questions/63389605
复制相似问题