首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在Pandas Dataframe中,Groupby和条件计数,得到周开始(星期日)和周结束日期列?

在Pandas Dataframe中,Groupby和条件计数,得到周开始(星期日)和周结束日期列?
EN

Stack Overflow用户
提问于 2021-12-22 14:23:33
回答 2查看 120关注 0票数 0

我有如下所示的数据集。我需要根据groupby对状态关闭的ticket_id进行计数,并且需要从ticket_time列中获得周开始(星期日)和周结束。输入数据

代码语言:javascript
复制
id       status     ticket_time         product         country
1260057   open     2021-10-03 01:20:00  Broadband       Grenada
2998178   open     2021-10-06 00:00:00  Fixed Voice     Bahamas
3762949  closed    2021-10-04 00:00:00  Fixed Voice     St Lucia
3766608  closed    2021-10-10 00:00:00  Broadband       St Lucia
3767125  closed    2021-10-03 00:00:00  TV              Antigua
etc.................

输出数据格式示例:

代码语言:javascript
复制
country  product    week_startdate     week_enddate          Value
antigua     TV      2021-10-03 00:00:00 2021-10-10 00:00:00  Count(id) with status=closed
St Lucia    Broadband   ...         ....                  ...
St Lucia    Fixedvoice  ...         ....                  ...

对于每一个国家和每一个产品,我应该得到id的计数,这是封闭的状态。按国家和产品分组会有效吗?

输入数据集

代码语言:javascript
复制
{'ticket_id': [1260057,
  2998178,
  3762949,
  3766608,
  3767125,
  6050009,
  6050608,
  6050972,
  6052253,
  6053697,
  6053812,
  6053848,
  6053970,
  6054363,
  6054549,
  6055529,
  6057303,
  6057364,
  6058428,
  6101321,
  6103451,
  6103688,
  6105180,
  693170,
  1259365,
  1259466,
  1260843,
  3762579,
  3762778,
  3764140,
  3768850,
  6050528,
  6050595,
  6051099,
  6053704,
  6054013,
  6054582,
  6055323,
  6056096,
  6056363,
  6057701,
  6058251,
  6058323,
  6102386,
  6103121,
  6104844,
  6104935,
  692914,
  1260730,
  2997275],
 'status': ['open',
  'open',
  'closed',
  'closed',
  'closed',
  'closed',
  'open',
  'open',
  'closed',
  'open',
  'open',
  'closed',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'closed',
  'closed',
  'closed',
  'open',
  'open',
  'closed',
  'closed',
  'open',
  'closed',
  'closed',
  'closed',
  'open',
  'closed',
  'closed',
  'closed',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'open',
  'cancelled',
  'open',
  'open',
  'open',
  'closed',
  'closed',
  'open'],
 'ticket_start_time': [Timestamp('2021-10-04 01:20:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-01 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-03 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-05 00:00:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-02 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-04 00:00:00'),
  Timestamp('2021-10-01 00:00:00')],
 'product_type': ['Broadband',
  'Fixed Voice',
  'Fixed Voice',
  'Broadband',
  'TV',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'TV',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Fixed Voice',
  'Broadband',
  'Fixed Voice',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'TV',
  'Fixed Voice',
  'TV',
  'TV',
  'TV',
  'TV',
  'TV',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Broadband',
  'Fixed Voice',
  'Fixed Voice'],
 'org_country': ['Grenada',
  'Bahamas',
  'St Lucia',
  'St Lucia',
  'Antigua',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Curacao',
  'Grenada',
  'Grenada',
  'Grenada',
  'St Lucia',
  'St Lucia',
  'St Vincent',
  'St Lucia',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Jamaica',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Trinidad',
  'Curacao',
  'Grenada',
  'Bahamas']}
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-12-22 16:54:37

修正答案

经过许多解释后,OP似乎正在寻找一周的开始和结束作为聚合的结果,而不是作为一个分组。

因此,以下是修改后的答案:

代码语言:javascript
复制
weekday = 6  # Sunday
wo = pd.offsets.Week(weekday=weekday, normalize=True)

t = df['ticket_start_time'] + wo
gbcols = ['org_country', 'product_type']
agg_dict = {
    'ticket_id': 'count',
    'week_startdate': 'first',
    'week_enddate': 'last',
}
out = (
    df
    .assign(week_startdate=t - wo, week_enddate=t)
    .groupby(['ticket_id', 'status'] + gbcols).agg(agg_dict)
    .query("status == 'closed'").groupby(gbcols).agg(agg_dict)
    .reset_index()
    .rename(columns={'ticket_id': 'Value'})
)

从提供的“输入数据”中,我们得到:

代码语言:javascript
复制
>>> out
   org_country product_type  Value week_startdate week_enddate
0      Antigua           TV      1     2021-10-03   2021-10-10
1      Curacao    Broadband      1     2021-10-03   2021-10-10
2      Grenada    Broadband      1     2021-09-26   2021-10-03
3      Grenada  Fixed Voice      2     2021-09-26   2021-10-10
4      Jamaica    Broadband      5     2021-09-26   2021-10-10
5      Jamaica           TV      1     2021-09-26   2021-10-03
6     St Lucia    Broadband      2     2021-09-26   2021-10-10
7     St Lucia  Fixed Voice      1     2021-09-26   2021-10-03
8     St Lucia           TV      1     2021-09-26   2021-10-03
9   St Vincent    Broadband      1     2021-09-26   2021-10-03
10    Trinidad    Broadband      2     2021-10-03   2021-10-10
11    Trinidad           TV      1     2021-09-26   2021-10-03

解释

首先,添加/然后减去周偏移量wo的技巧绕过了当日期位于偏移边界时减去偏移量的问题。例如:

代码语言:javascript
复制
t = pd.to_datetime([
    '2021-10-02', '2021-10-03', '2021-10-03 01:00',
    '2021-10-04', '2021-10-09 23:59:59', '2021-10-10']
)
>>> pd.DataFrame(dict(t=t, ws=t + wo - wo, we=t + wo))
                    t         ws         we
0 2021-10-02 00:00:00 2021-09-26 2021-10-03
1 2021-10-03 00:00:00 2021-10-03 2021-10-10
2 2021-10-03 01:00:00 2021-10-03 2021-10-10
3 2021-10-04 00:00:00 2021-10-03 2021-10-10
4 2021-10-09 23:59:59 2021-10-03 2021-10-10
5 2021-10-10 00:00:00 2021-10-10 2021-10-17

接下来,我们希望ticket_id的计数是不同的,其中状态是'closed'。这有点棘手。我们和两个人一起做。第一种方法是统计每个(ticket, status) (和所有其他分组列)组的计数。可能会有多个相同的票被打开或关闭的实例,但是我们想忽略这一点。

代码语言:javascript
复制
tmp = (
    df
    .assign(week_startdate=t - wo, week_enddate=t)
    .groupby(['ticket_id', 'status'] + gbcols)
    .agg(agg_dict)
)
>>> tmp.head()
                                           ticket_id week_startdate  \
ticket_id status org_country product_type                             
692914    closed Curacao     Broadband             1     2021-10-03   
693170    open   Curacao     Broadband             1     2021-10-03   
1259365   closed Grenada     Fixed Voice           1     2021-09-26   
1259466   closed Grenada     Broadband             1     2021-09-26   
1260057   open   Grenada     Broadband             1     2021-10-03   

                                          week_enddate  
ticket_id status org_country product_type               
692914    closed Curacao     Broadband      2021-10-10  
693170    open   Curacao     Broadband      2021-10-10  
1259365   closed Grenada     Fixed Voice    2021-10-03  
1259466   closed Grenada     Broadband      2021-10-03  
1260057   open   Grenada     Broadband      2021-10-10  

最后,我们现在只是简单地再次聚合,但只在gbcols列上分组,并且状态在其中“关闭”,给出了上面的完整表达式。

附录

OP希望将week_startdate和end日期转换为字符串,并使用完整的H:M:S解析。为此:

代码语言:javascript
复制
for k in ['week_startdate', 'week_enddate']:
    out[k] = out[k].dt.strftime('%F %T')

现在:

代码语言:javascript
复制
>>> out
   org_country product_type  Value       week_startdate         week_enddate
0      Antigua           TV      1  2021-10-03 00:00:00  2021-10-10 00:00:00
1      Curacao    Broadband      1  2021-10-03 00:00:00  2021-10-10 00:00:00
2      Grenada    Broadband      1  2021-09-26 00:00:00  2021-10-03 00:00:00
3      Grenada  Fixed Voice      2  2021-09-26 00:00:00  2021-10-10 00:00:00
4      Jamaica    Broadband      5  2021-09-26 00:00:00  2021-10-10 00:00:00
5      Jamaica           TV      1  2021-09-26 00:00:00  2021-10-03 00:00:00
6     St Lucia    Broadband      2  2021-09-26 00:00:00  2021-10-10 00:00:00
7     St Lucia  Fixed Voice      1  2021-09-26 00:00:00  2021-10-03 00:00:00
8     St Lucia           TV      1  2021-09-26 00:00:00  2021-10-03 00:00:00
9   St Vincent    Broadband      1  2021-09-26 00:00:00  2021-10-03 00:00:00
10    Trinidad    Broadband      2  2021-10-03 00:00:00  2021-10-10 00:00:00
11    Trinidad           TV      1  2021-09-26 00:00:00  2021-10-03 00:00:00
票数 1
EN

Stack Overflow用户

发布于 2021-12-22 15:29:18

IIUC:

代码语言:javascript
复制
def week_start_day(s):
    return s - (pd.to_timedelta((s.dt.weekday + 1)%7, unit='D'))

def week_end_day(s):
    return s + pd.to_timedelta(7-(s.dt.weekday+2)%7, unit='D')

df['ticket_time'] = pd.to_datetime(df['ticket_time'])

df = df[df['status'] == 'closed'].groupby(['country', 'product']).agg(week_startdate = pd.NamedAgg(column='ticket_time', aggfunc=week_start_day),
                                                                      week_enddate = pd.NamedAgg(column='ticket_time', aggfunc=week_end_day),
                                                                      Value = pd.NamedAgg(column='id', aggfunc="size"))
df.reset_index(inplace=True)

OUTPUT

代码语言:javascript
复制
    country      product week_startdate week_enddate  Value
0   Antigua           TV     2021-10-03   2021-10-09      1
1  St Lucia    Broadband     2021-10-10   2021-10-16      1
2  St Lucia  Fixed Voice     2021-10-03   2021-10-09      1

Setup

代码语言:javascript
复制
data = {'id': [1260057, 2998178, 3762949, 3766608, 3767125],
          'status': ['open', 'open', 'closed', 'closed', 'closed'],
          'ticket_time': ['2021-10-03 01:20:00', '2021-10-06 00:00:00', '2021-10-04 00:00:00',
                          '2021-10-10 00:00:00', '2021-10-03 00:00:00'],
          'product': ['Broadband', 'Fixed Voice', 'Fixed Voice', 'Broadband', 'TV'],
          'country': ['Grenada', 'Bahamas', 'St Lucia', 'St Lucia', 'Antigua']}

df = pd.DataFrame.from_dict(data)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70450520

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档