我试图为DATE FROM列中的日期范围提取缺少的日期列表,对于列CURRENCY中的多个组提取DATE TO,对于每个组列CURRENCY,这些范围被分割成多个行:
例如:货币欧元在第0、1和2行中有三个日期范围,整个组中缺少的范围是2021-10-06至2021-10-10。
缺失范围在下面使用*符号很高,这个缺失的日期范围是我期望的输出。
df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
"DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
"5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
"DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
"5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})
CURRENCY DATE FROM DATE TO
0 EUR 2021-10-01 2021-10-05*
1 EUR 2021-10-11* 2021-10-18
2 EUR 2021-10-19 2021-10-23
3 GBP 2021-10-01 2021-10-04
4 GBP 2021-10-05 2021-10-07*
5 GBP 2021-10-11* 2021-10-18
6 USD 2021-10-01 2021-10-02*
7 USD 2021-10-05* 2021-10-10
8 SAR 2021-10-01 2021-10-01*
9 SAR 2021-10-05* 2021-10-10预期产出:
CURRENCY MISSING
0 EUR 2021-10-06
1 EUR 2021-10-07
2 EUR 2021-10-08
3 EUR 2021-10-09
4 EUR 2021-10-10
5 GBP 2021-10-08
6 GBP 2021-10-09
7 GBP 2021-10-10
8 USD 2021-10-03
9 USD 2021-10-04
10 SAR 2021-10-02
11 SAR 2021-10-03
12 SAR 2021-10-04下面是我尝试过的,但这似乎不是解决问题的有效方法,我也不确定如何对这些输出范围进行分组,以找到每个组的缺失日期(EUR,GBP...etc):
date_from_list = df['DATE FROM'].to_list()
date_to_list = df['DATE TO'].to_list()
curr_list = df['CURRENCY'].to_list()
for date_from, date_to, curr in zip(date_from_list, date_to_list, curr_list):
print(curr_list, pd.date_range(date_from, date_to))发布于 2021-11-03 12:50:11
您可以使用以下代码解决问题,即将每种货币的总范围列表与每一行的单个范围列表进行比较。使用嵌套循环的过滤可能会被优化。
import numpy as np
df['dateranges'] = df.apply(lambda x: pd.date_range(x['DATE FROM'],x['DATE TO']), axis=1)
total_ranges = (df.groupby("CURRENCY")
.apply(lambda x: pd.date_range(x['DATE FROM'].min(),x['DATE TO'].max())))
individual_ranges = df.groupby("CURRENCY")['dateranges'].apply(list).apply(np.concatenate)
currency = []
missing = []
for curr in total_ranges.index:
for date in total_ranges[curr]:
if date not in individual_ranges[curr]:
currency.append(curr)
missing.append(date)
result = pd.DataFrame({'CURRENCY': currency,
'MISSING':missing})发布于 2021-11-03 13:44:02
pd.date_range和pd.period_range都可以。我使用的日期范围,只有接近的左右,所以你可能要做一些筛选。代码如下
df= df.assign(end=df['DATE FROM'].shift(-1),start=df['DATE TO']).iloc[:-1 , :]#Define the start and end for date range
df=df.assign(Missing=df.apply(lambda x: pd.date_range(start=x['start'], end=x['end'], closed='right').tolist(), axis = 1)).explode('Missing').drop_duplicates('Missing').drop(['start','end'],axis=1)
CURRENCY DATE FROM DATE TO Missing
0 EUR 2021-10-01 2021-10-05 2021-10-06
0 EUR 2021-10-01 2021-10-05 2021-10-07
0 EUR 2021-10-01 2021-10-05 2021-10-08
0 EUR 2021-10-01 2021-10-05 2021-10-09
0 EUR 2021-10-01 2021-10-05 2021-10-10
0 EUR 2021-10-01 2021-10-05 2021-10-11
1 EUR 2021-10-11 2021-10-18 2021-10-19
2 EUR 2021-10-19 2021-10-23 NaT
3 GBP 2021-10-01 2021-10-04 2021-10-05
6 USD 2021-10-01 2021-10-02 2021-10-03
6 USD 2021-10-01 2021-10-02 2021-10-04
8 SAR 2021-10-01 2021-10-01 2021-10-02发布于 2021-11-03 13:10:05
该解决方案使用了一个名为piso的包(熊猫间隔设置操作),而且速度很快。
设置
df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
"DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
"5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
"DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
"5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})
df["DATE FROM"] = pd.to_datetime(df["DATE FROM"])
df["DATE TO"] = pd.to_datetime(df["DATE TO"]) + pd.Timedelta("1d")注意,我正在将数据更改为pandas.Timestamp,并将一天添加到结束日期。这是因为您似乎将日期解释为时间段,即一整天,在那里我们将及时处理即时的问题。
溶液
并为每种货币创建一个间隔索引
interval_arrays = df.groupby("CURRENCY").apply(lambda d: pd.IntervalIndex.from_arrays(d["DATE FROM"], d["DATE TO"]))interval_arrays看起来像这样
CURRENCY
EUR IntervalIndex([(2021-10-01, 2021-10-06], (2021...
GBP IntervalIndex([(2021-10-01, 2021-10-05], (2021...
SAR IntervalIndex([(2021-10-01, 2021-10-02], (2021...
USD IntervalIndex([(2021-10-01, 2021-10-03], (2021...
dtype: object创建缺少的间隔,作为pandas.IntervalIndex使用piso.complement。
import piso
missing = interval_arrays.apply(piso.complement)missing看起来就像
CURRENCY
EUR IntervalIndex([(2021-10-06, 2021-10-11]], ...
GBP IntervalIndex([(2021-10-08, 2021-10-11]], ...
SAR IntervalIndex([(2021-10-02, 2021-10-05]], ...
USD IntervalIndex([(2021-10-03, 2021-10-05]], ...
dtype: object接下来,我们需要将每个IntervalIndex中的时间间隔转换为日期范围,并将日期范围组合成一个pandas.DatetimeIndex,该pandas.DatetimeIndex封装在一个数据格式中。
def calc(currency, interval_index):
date_ranges = [pd.date_range(i.left, i.right, closed="left") for i in interval_index]
combined = date_ranges[0].union_many(date_ranges[1:])
return pd.DataFrame({"MISSING":combined}).assign(CURRENCY=currency)
pd.concat([calc(currency, interval_index) for currency, interval_index in missing.iteritems() if len(interval_index) > 0])结果是以下数据
MISSING CURRENCY
0 2021-10-06 EUR
1 2021-10-07 EUR
2 2021-10-08 EUR
3 2021-10-09 EUR
4 2021-10-10 EUR
0 2021-10-08 GBP
1 2021-10-09 GBP
2 2021-10-10 GBP
0 2021-10-02 SAR
1 2021-10-03 SAR
2 2021-10-04 SAR
0 2021-10-03 USD
1 2021-10-04 USDhttps://stackoverflow.com/questions/69824372
复制相似问题