我有如下所示的数据集。我需要根据groupby对状态关闭的ticket_id进行计数,并且需要从ticket_time列中获得周开始(星期日)和周结束。输入数据
id status ticket_time product country
1260057 open 2021-10-03 01:20:00 Broadband Grenada
2998178 open 2021-10-06 00:00:00 Fixed Voice Bahamas
3762949 closed 2021-10-04 00:00:00 Fixed Voice St Lucia
3766608 closed 2021-10-10 00:00:00 Broadband St Lucia
3767125 closed 2021-10-03 00:00:00 TV Antigua
etc.................输出数据格式示例:
country product week_startdate week_enddate Value
antigua TV 2021-10-03 00:00:00 2021-10-10 00:00:00 Count(id) with status=closed
St Lucia Broadband ... .... ...
St Lucia Fixedvoice ... .... ...对于每一个国家和每一个产品,我应该得到id的计数,这是封闭的状态。按国家和产品分组会有效吗?
输入数据集
{'ticket_id': [1260057,
2998178,
3762949,
3766608,
3767125,
6050009,
6050608,
6050972,
6052253,
6053697,
6053812,
6053848,
6053970,
6054363,
6054549,
6055529,
6057303,
6057364,
6058428,
6101321,
6103451,
6103688,
6105180,
693170,
1259365,
1259466,
1260843,
3762579,
3762778,
3764140,
3768850,
6050528,
6050595,
6051099,
6053704,
6054013,
6054582,
6055323,
6056096,
6056363,
6057701,
6058251,
6058323,
6102386,
6103121,
6104844,
6104935,
692914,
1260730,
2997275],
'status': ['open',
'open',
'closed',
'closed',
'closed',
'closed',
'open',
'open',
'closed',
'open',
'open',
'closed',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'closed',
'closed',
'closed',
'open',
'open',
'closed',
'closed',
'open',
'closed',
'closed',
'closed',
'open',
'closed',
'closed',
'closed',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'cancelled',
'open',
'open',
'open',
'closed',
'closed',
'open'],
'ticket_start_time': [Timestamp('2021-10-04 01:20:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-01 00:00:00')],
'product_type': ['Broadband',
'Fixed Voice',
'Fixed Voice',
'Broadband',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Fixed Voice',
'Broadband',
'Fixed Voice',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'TV',
'Fixed Voice',
'TV',
'TV',
'TV',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Fixed Voice',
'Fixed Voice'],
'org_country': ['Grenada',
'Bahamas',
'St Lucia',
'St Lucia',
'Antigua',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Trinidad',
'Trinidad',
'Trinidad',
'Trinidad',
'Curacao',
'Grenada',
'Grenada',
'Grenada',
'St Lucia',
'St Lucia',
'St Vincent',
'St Lucia',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Trinidad',
'Trinidad',
'Trinidad',
'Trinidad',
'Curacao',
'Grenada',
'Bahamas']}发布于 2021-12-22 16:54:37
修正答案
经过许多解释后,OP似乎正在寻找一周的开始和结束作为聚合的结果,而不是作为一个分组。
因此,以下是修改后的答案:
weekday = 6 # Sunday
wo = pd.offsets.Week(weekday=weekday, normalize=True)
t = df['ticket_start_time'] + wo
gbcols = ['org_country', 'product_type']
agg_dict = {
'ticket_id': 'count',
'week_startdate': 'first',
'week_enddate': 'last',
}
out = (
df
.assign(week_startdate=t - wo, week_enddate=t)
.groupby(['ticket_id', 'status'] + gbcols).agg(agg_dict)
.query("status == 'closed'").groupby(gbcols).agg(agg_dict)
.reset_index()
.rename(columns={'ticket_id': 'Value'})
)从提供的“输入数据”中,我们得到:
>>> out
org_country product_type Value week_startdate week_enddate
0 Antigua TV 1 2021-10-03 2021-10-10
1 Curacao Broadband 1 2021-10-03 2021-10-10
2 Grenada Broadband 1 2021-09-26 2021-10-03
3 Grenada Fixed Voice 2 2021-09-26 2021-10-10
4 Jamaica Broadband 5 2021-09-26 2021-10-10
5 Jamaica TV 1 2021-09-26 2021-10-03
6 St Lucia Broadband 2 2021-09-26 2021-10-10
7 St Lucia Fixed Voice 1 2021-09-26 2021-10-03
8 St Lucia TV 1 2021-09-26 2021-10-03
9 St Vincent Broadband 1 2021-09-26 2021-10-03
10 Trinidad Broadband 2 2021-10-03 2021-10-10
11 Trinidad TV 1 2021-09-26 2021-10-03解释
首先,添加/然后减去周偏移量wo的技巧绕过了当日期位于偏移边界时减去偏移量的问题。例如:
t = pd.to_datetime([
'2021-10-02', '2021-10-03', '2021-10-03 01:00',
'2021-10-04', '2021-10-09 23:59:59', '2021-10-10']
)
>>> pd.DataFrame(dict(t=t, ws=t + wo - wo, we=t + wo))
t ws we
0 2021-10-02 00:00:00 2021-09-26 2021-10-03
1 2021-10-03 00:00:00 2021-10-03 2021-10-10
2 2021-10-03 01:00:00 2021-10-03 2021-10-10
3 2021-10-04 00:00:00 2021-10-03 2021-10-10
4 2021-10-09 23:59:59 2021-10-03 2021-10-10
5 2021-10-10 00:00:00 2021-10-10 2021-10-17接下来,我们希望ticket_id的计数是不同的,其中状态是'closed'。这有点棘手。我们和两个人一起做。第一种方法是统计每个(ticket, status) (和所有其他分组列)组的计数。可能会有多个相同的票被打开或关闭的实例,但是我们想忽略这一点。
tmp = (
df
.assign(week_startdate=t - wo, week_enddate=t)
.groupby(['ticket_id', 'status'] + gbcols)
.agg(agg_dict)
)
>>> tmp.head()
ticket_id week_startdate \
ticket_id status org_country product_type
692914 closed Curacao Broadband 1 2021-10-03
693170 open Curacao Broadband 1 2021-10-03
1259365 closed Grenada Fixed Voice 1 2021-09-26
1259466 closed Grenada Broadband 1 2021-09-26
1260057 open Grenada Broadband 1 2021-10-03
week_enddate
ticket_id status org_country product_type
692914 closed Curacao Broadband 2021-10-10
693170 open Curacao Broadband 2021-10-10
1259365 closed Grenada Fixed Voice 2021-10-03
1259466 closed Grenada Broadband 2021-10-03
1260057 open Grenada Broadband 2021-10-10 最后,我们现在只是简单地再次聚合,但只在gbcols列上分组,并且状态在其中“关闭”,给出了上面的完整表达式。
附录
OP希望将week_startdate和end日期转换为字符串,并使用完整的H:M:S解析。为此:
for k in ['week_startdate', 'week_enddate']:
out[k] = out[k].dt.strftime('%F %T')现在:
>>> out
org_country product_type Value week_startdate week_enddate
0 Antigua TV 1 2021-10-03 00:00:00 2021-10-10 00:00:00
1 Curacao Broadband 1 2021-10-03 00:00:00 2021-10-10 00:00:00
2 Grenada Broadband 1 2021-09-26 00:00:00 2021-10-03 00:00:00
3 Grenada Fixed Voice 2 2021-09-26 00:00:00 2021-10-10 00:00:00
4 Jamaica Broadband 5 2021-09-26 00:00:00 2021-10-10 00:00:00
5 Jamaica TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
6 St Lucia Broadband 2 2021-09-26 00:00:00 2021-10-10 00:00:00
7 St Lucia Fixed Voice 1 2021-09-26 00:00:00 2021-10-03 00:00:00
8 St Lucia TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
9 St Vincent Broadband 1 2021-09-26 00:00:00 2021-10-03 00:00:00
10 Trinidad Broadband 2 2021-10-03 00:00:00 2021-10-10 00:00:00
11 Trinidad TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00发布于 2021-12-22 15:29:18
IIUC:
def week_start_day(s):
return s - (pd.to_timedelta((s.dt.weekday + 1)%7, unit='D'))
def week_end_day(s):
return s + pd.to_timedelta(7-(s.dt.weekday+2)%7, unit='D')
df['ticket_time'] = pd.to_datetime(df['ticket_time'])
df = df[df['status'] == 'closed'].groupby(['country', 'product']).agg(week_startdate = pd.NamedAgg(column='ticket_time', aggfunc=week_start_day),
week_enddate = pd.NamedAgg(column='ticket_time', aggfunc=week_end_day),
Value = pd.NamedAgg(column='id', aggfunc="size"))
df.reset_index(inplace=True)OUTPUT
country product week_startdate week_enddate Value
0 Antigua TV 2021-10-03 2021-10-09 1
1 St Lucia Broadband 2021-10-10 2021-10-16 1
2 St Lucia Fixed Voice 2021-10-03 2021-10-09 1Setup
data = {'id': [1260057, 2998178, 3762949, 3766608, 3767125],
'status': ['open', 'open', 'closed', 'closed', 'closed'],
'ticket_time': ['2021-10-03 01:20:00', '2021-10-06 00:00:00', '2021-10-04 00:00:00',
'2021-10-10 00:00:00', '2021-10-03 00:00:00'],
'product': ['Broadband', 'Fixed Voice', 'Fixed Voice', 'Broadband', 'TV'],
'country': ['Grenada', 'Bahamas', 'St Lucia', 'St Lucia', 'Antigua']}
df = pd.DataFrame.from_dict(data)https://stackoverflow.com/questions/70450520
复制相似问题