首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在多组的多个日期范围内查找缺少的日期

在多组的多个日期范围内查找缺少的日期
EN

Stack Overflow用户
提问于 2021-11-03 11:54:04
回答 3查看 141关注 0票数 3

我试图为DATE FROM列中的日期范围提取缺少的日期列表,对于列CURRENCY中的多个组提取DATE TO,对于每个组列CURRENCY,这些范围被分割成多个行:

例如:货币欧元在第0、1和2行中有三个日期范围,整个组中缺少的范围是2021-10-06至2021-10-10。

缺失范围在下面使用*符号很高,这个缺失的日期范围是我期望的输出。

代码语言:javascript
复制
df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
                   "DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
                                "5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
                   "DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
                              "5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})

    CURRENCY    DATE FROM   DATE TO
0   EUR         2021-10-01  2021-10-05*
1   EUR         2021-10-11* 2021-10-18
2   EUR         2021-10-19  2021-10-23
3   GBP         2021-10-01  2021-10-04
4   GBP         2021-10-05  2021-10-07*
5   GBP         2021-10-11* 2021-10-18
6   USD         2021-10-01  2021-10-02*
7   USD         2021-10-05* 2021-10-10
8   SAR         2021-10-01  2021-10-01*
9   SAR         2021-10-05* 2021-10-10

预期产出:

代码语言:javascript
复制
    CURRENCY    MISSING
0   EUR         2021-10-06
1   EUR         2021-10-07
2   EUR         2021-10-08
3   EUR         2021-10-09
4   EUR         2021-10-10
5   GBP         2021-10-08
6   GBP         2021-10-09
7   GBP         2021-10-10
8   USD         2021-10-03
9   USD         2021-10-04
10  SAR         2021-10-02
11  SAR         2021-10-03
12  SAR         2021-10-04

下面是我尝试过的,但这似乎不是解决问题的有效方法,我也不确定如何对这些输出范围进行分组,以找到每个组的缺失日期(EUR,GBP...etc):

代码语言:javascript
复制
date_from_list = df['DATE FROM'].to_list()
date_to_list = df['DATE TO'].to_list()
curr_list = df['CURRENCY'].to_list()

for date_from, date_to, curr in zip(date_from_list, date_to_list, curr_list):
    print(curr_list, pd.date_range(date_from, date_to))
EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2021-11-03 12:50:11

您可以使用以下代码解决问题,即将每种货币的总范围列表与每一行的单个范围列表进行比较。使用嵌套循环的过滤可能会被优化。

代码语言:javascript
复制
import numpy as np
df['dateranges'] = df.apply(lambda x: pd.date_range(x['DATE FROM'],x['DATE TO']), axis=1)
total_ranges = (df.groupby("CURRENCY")
                .apply(lambda x: pd.date_range(x['DATE FROM'].min(),x['DATE TO'].max())))
individual_ranges = df.groupby("CURRENCY")['dateranges'].apply(list).apply(np.concatenate)

currency = []
missing = []
for curr in total_ranges.index:
    for date in total_ranges[curr]:
        if date not in individual_ranges[curr]:
            currency.append(curr)
            missing.append(date)

result = pd.DataFrame({'CURRENCY': currency,
                       'MISSING':missing})
票数 1
EN

Stack Overflow用户

发布于 2021-11-03 13:44:02

pd.date_range和pd.period_range都可以。我使用的日期范围,只有接近的左右,所以你可能要做一些筛选。代码如下

代码语言:javascript
复制
df= df.assign(end=df['DATE FROM'].shift(-1),start=df['DATE TO']).iloc[:-1 , :]#Define the start and end for date range

df=df.assign(Missing=df.apply(lambda x: pd.date_range(start=x['start'],  end=x['end'], closed='right').tolist(), axis = 1)).explode('Missing').drop_duplicates('Missing').drop(['start','end'],axis=1)

  CURRENCY   DATE FROM     DATE TO    Missing
0      EUR  2021-10-01  2021-10-05 2021-10-06
0      EUR  2021-10-01  2021-10-05 2021-10-07
0      EUR  2021-10-01  2021-10-05 2021-10-08
0      EUR  2021-10-01  2021-10-05 2021-10-09
0      EUR  2021-10-01  2021-10-05 2021-10-10
0      EUR  2021-10-01  2021-10-05 2021-10-11
1      EUR  2021-10-11  2021-10-18 2021-10-19
2      EUR  2021-10-19  2021-10-23        NaT
3      GBP  2021-10-01  2021-10-04 2021-10-05
6      USD  2021-10-01  2021-10-02 2021-10-03
6      USD  2021-10-01  2021-10-02 2021-10-04
8      SAR  2021-10-01  2021-10-01 2021-10-02
票数 2
EN

Stack Overflow用户

发布于 2021-11-03 13:10:05

该解决方案使用了一个名为piso的包(熊猫间隔设置操作),而且速度很快。

设置

代码语言:javascript
复制
df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
               "DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
                            "5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
               "DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
                          "5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})

df["DATE FROM"] = pd.to_datetime(df["DATE FROM"])
df["DATE TO"] = pd.to_datetime(df["DATE TO"]) + pd.Timedelta("1d")

注意,我正在将数据更改为pandas.Timestamp,并将一天添加到结束日期。这是因为您似乎将日期解释为时间段,即一整天,在那里我们将及时处理即时的问题。

溶液

并为每种货币创建一个间隔索引

代码语言:javascript
复制
interval_arrays = df.groupby("CURRENCY").apply(lambda d: pd.IntervalIndex.from_arrays(d["DATE FROM"], d["DATE TO"]))

interval_arrays看起来像这样

代码语言:javascript
复制
CURRENCY
EUR    IntervalIndex([(2021-10-01, 2021-10-06], (2021...
GBP    IntervalIndex([(2021-10-01, 2021-10-05], (2021...
SAR    IntervalIndex([(2021-10-01, 2021-10-02], (2021...
USD    IntervalIndex([(2021-10-01, 2021-10-03], (2021...
dtype: object

创建缺少的间隔,作为pandas.IntervalIndex使用piso.complement

代码语言:javascript
复制
import piso
missing = interval_arrays.apply(piso.complement)

missing看起来就像

代码语言:javascript
复制
CURRENCY
EUR    IntervalIndex([(2021-10-06, 2021-10-11]], ...
GBP    IntervalIndex([(2021-10-08, 2021-10-11]], ... 
SAR    IntervalIndex([(2021-10-02, 2021-10-05]], ...
USD    IntervalIndex([(2021-10-03, 2021-10-05]], ...
dtype: object

接下来,我们需要将每个IntervalIndex中的时间间隔转换为日期范围,并将日期范围组合成一个pandas.DatetimeIndex,该pandas.DatetimeIndex封装在一个数据格式中。

代码语言:javascript
复制
def calc(currency, interval_index):
    date_ranges = [pd.date_range(i.left, i.right, closed="left") for i in interval_index]
    combined = date_ranges[0].union_many(date_ranges[1:])
    return pd.DataFrame({"MISSING":combined}).assign(CURRENCY=currency)

pd.concat([calc(currency, interval_index) for currency, interval_index in missing.iteritems() if len(interval_index) > 0])

结果是以下数据

代码语言:javascript
复制
    MISSING CURRENCY
0 2021-10-06      EUR
1 2021-10-07      EUR
2 2021-10-08      EUR
3 2021-10-09      EUR
4 2021-10-10      EUR
0 2021-10-08      GBP
1 2021-10-09      GBP
2 2021-10-10      GBP
0 2021-10-02      SAR
1 2021-10-03      SAR
2 2021-10-04      SAR
0 2021-10-03      USD
1 2021-10-04      USD
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69824372

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档