我有以下代码数据..。
import pandas as pd, numpy as np
from datetime import datetime
end_dt = datetime.today()
st_dt = (end_dt + pd.DateOffset(-10)).date()
df_index = pd.date_range(st_dt, end_dt)
df = pd.DataFrame(index=df_index, columns=['in_range'])
data = [pd.to_datetime(['2022-11-08','2022-11-10']), pd.to_datetime(['2022-11-13','2022-11-15'])]
dt_ranges = pd.DataFrame(data,columns={'st_dt':'datetimens[64]', 'end_dt': 'datetimens[64]'})这将产生以下两个数据格式:
df:
in_range
2022-11-08 NaN
2022-11-09 NaN
2022-11-10 NaN
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 NaN
2022-11-14 NaN
2022-11-15 NaN
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN和date_ranges:
st_dt end_dt
0 2022-11-08 2022-11-10
1 2022-11-13 2022-11-15我想更新“in_range”列,以指示索引是否在第二次数据访问的开始日期和结束日期的任何对之内。所以我最终应该是这样的:
in_range
2022-11-08 True
2022-11-09 True
2022-11-10 True
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 True
2022-11-14 True
2022-11-15 True
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN我已经开始尝试使用lambda和迭代来完成这个任务了。但对我来说这似乎很有效率。
def in_range(index_date, date_ranges):
for r in date_ranges.values:
if (r[0] >= index_date) & (r[1] <= index_date):
return True
return False
df['in_range'] = df.reset_index().apply(lambda x: in_range(x.date, dt_ranges), axis=1)上面的代码总是将in_range设置为NaNs,尽管代码返回了正确的值。我怀疑这是因为我正在重置索引,所以它不能对齐。而且,正如前面提到的--这个解决方案可能效率很低。
是否有更多的丙酮/大流行的方式来做到这一点?
发布于 2022-11-18 21:10:11
s = df.index.to_series()
m = (pd.merge_asof(s.rename('st_dt'), dt_ranges)
['end_dt'].ge(s.to_numpy()).to_numpy()
)
df.loc[m, 'in_range'] = True注意:dt_ranges中的间隔应该是不重叠的。
输出:
in_range
2022-11-08 True
2022-11-09 True
2022-11-10 True
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 True
2022-11-14 True
2022-11-15 True
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN发布于 2022-11-19 05:20:51
一种选择是使用加入计算非equi,它可以处理重叠:
# pip install pyjanitor
import pandas as pd
import janitor
(
df
.reset_index()
.conditional_join(
dt_ranges,
('index', 'st_dt', '>='),
('index', 'end_dt', '<='),
# depending on your data size
# setting use_numba to True
# can improve performance
# of course, this requires numba installed
use_numba = False,
how = 'left',
# performance is better when
# sort_by_appearance is False
sort_by_appearance=True)
.assign(in_range = lambda df: df.in_range.mask(df.st_dt.notna(), True))
.iloc[:, :2]
.set_index('index')
)
in_range
index
2022-11-08 True
2022-11-09 True
2022-11-10 True
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 True
2022-11-14 True
2022-11-15 True
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaNhttps://stackoverflow.com/questions/74495142
复制相似问题