我真的很感谢你在这方面的帮助。
我有一个包含产品、日期和金额的表。这就是初始表的样子。
Product ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60其目的是将每一行拆分成到年底的剩余月数,如果是第一年,则从第一年的月份开始拆分到年底,如果年份是结束年,则拆分到年末的月份。收入需要根据当月的行数进行划分,因为第一个表中的收入指的是整个期间。在这期间的所有年份将被分成12行,以及每个月的收入。
Product ID goliveyear endyear Year Month Revenue
1 2020-10 2022-02 2020 10 90/3=30
1 2020-10 2022-02 2020 11 30
1 2020-10 2022-02 2020 12 30
1 2020-10 2022-02 2021 01 140/12 =11.67
1 2020-10 2022-02 2021 02 11.67
1 2020-10 2022-02 2021 03 11.67
1 2020-10 2022-02 2021 04 11.67
... ... ... ... ... ...
1 2020-10 2022-02 2022 01 60/2 = 30
1 2020-10 2022-02 2022 02 30非常感谢大家。
发布于 2020-12-03 18:51:20
相当多的步骤。
从设置df开始
from io import StringIO
import pandas as pd
from datetime import datetime,timedelta
df = pd.read_csv(StringIO(
"""
Product_ID goliveyear endyear Revenue
1 2020-10 2022-02 90
1 2020-10 2022-02 140
1 2020-10 2022-02 60
"""), delim_whitespace=True)
df['goliveyear'] = pd.to_datetime(df['goliveyear'])
df['endyear'] = pd.to_datetime(df['endyear'])
df然后添加year_start、year_end、period_start、period_end列
df['ys'] = df['goliveyear'].dt.year + df.groupby('Product_ID').cumcount()
df['ye'] = df['ys'] + 1
df['ys'] = pd.to_datetime(df['ys'], format = '%Y')
df['ye'] = pd.to_datetime(df['ye'], format = '%Y')+ timedelta(days=-1)
df['ps'] = df[['goliveyear','ys']].max(axis=1)
df['pe'] = df[['endyear','ye']].min(axis=1) 产生
Product_ID goliveyear endyear Revenue ys ye ps pe
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- -------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00然后,首先添加months作为列表
df['months'] = df.apply(lambda r: [d.month for d in pd.date_range(r['ps'], r['pe'], freq='MS', closed = None).to_pydatetime()], axis=1)输出:
Product_ID goliveyear endyear Revenue ys ye ps pe months
-- ------------ ------------------- ------------------- --------- ------------------- ------------------- ------------------- ------------------- ---------------------------------------
0 1 2020-10-01 00:00:00 2022-02-01 00:00:00 90 2020-01-01 00:00:00 2020-12-31 00:00:00 2020-10-01 00:00:00 2020-12-31 00:00:00 [10, 11, 12]
1 1 2020-10-01 00:00:00 2022-02-01 00:00:00 140 2021-01-01 00:00:00 2021-12-31 00:00:00 2021-01-01 00:00:00 2021-12-31 00:00:00 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
2 1 2020-10-01 00:00:00 2022-02-01 00:00:00 60 2022-01-01 00:00:00 2022-12-31 00:00:00 2022-01-01 00:00:00 2022-02-01 00:00:00 [1, 2]然后我们分解months,对收入执行所需的计算,并删除不需要的列
df = df.explode('months')
df['Revenue'] = df['Revenue'] / df.groupby(['Product_ID','ys'])['months'].transform('count')
df = df.drop(columns = ['goliveyear','endyear','ye','ps','pe'])
df['ys'] = df['ys'].dt.year要获得
Product_ID Revenue ys months
-- ------------ --------- ---- --------
0 1 30 2020 10
0 1 30 2020 11
0 1 30 2020 12
1 1 11.6667 2021 1
1 1 11.6667 2021 2
1 1 11.6667 2021 3
1 1 11.6667 2021 4
1 1 11.6667 2021 5
1 1 11.6667 2021 6
1 1 11.6667 2021 7
1 1 11.6667 2021 8
1 1 11.6667 2021 9
1 1 11.6667 2021 10
1 1 11.6667 2021 11
1 1 11.6667 2021 12
2 1 30 2022 1
2 1 30 2022 2发布于 2020-12-03 18:27:43
试试这个:
import pandas as pd
from io import StringIO
s = """
Product ID,goliveyear,endyear,Revenue
1,2020-10,2022-02,90
1,2020-10,2022-02,140
1,2020-10,2022-02,60"""
df = pd.read_csv(StringIO(s))
# generate the months list between thest two months
df['rng'] = df.apply(lambda x: pd.date_range(x['goliveyear'], x['endyear'],
freq='MS'), axis=1)
# explode the dataframe by months list
df_exploded = df.explode('rng')
df_exploded['Year'] = df_exploded['rng'].dt.year
df_exploded['Month'] = df_exploded['rng'].dt.month
# the (index,year) pair to fliter rows
filter_year = list(zip(df.index, df_exploded.Year.unique()))
# used columns
use_cols = ['Product ID', 'goliveyear', 'endyear', 'Revenue', 'Month']
# filter rows
df_filter = df_exploded.set_index([df_exploded.index,
df_exploded.Year]).loc[filter_year,
use_cols].reset_index().drop(columns='level_0')
# calculate the average Revenue
result = df_filter.set_index(['Year', "Month"]).assign(
Revenue=(df_filter.groupby(['Year', 'Month'])['Revenue'].sum() /
df_filter.groupby('Year')['Month'].count())
).reset_index()
result输出
Year Month Product ID goliveyear endyear Revenue
0 2020 10 1 2020-10 2022-02 30.000000
1 2020 11 1 2020-10 2022-02 30.000000
2 2020 12 1 2020-10 2022-02 30.000000
3 2021 1 1 2020-10 2022-02 11.666667
4 2021 2 1 2020-10 2022-02 11.666667
5 2021 3 1 2020-10 2022-02 11.666667
6 2021 4 1 2020-10 2022-02 11.666667
7 2021 5 1 2020-10 2022-02 11.666667
8 2021 6 1 2020-10 2022-02 11.666667
9 2021 7 1 2020-10 2022-02 11.666667
10 2021 8 1 2020-10 2022-02 11.666667
11 2021 9 1 2020-10 2022-02 11.666667
12 2021 10 1 2020-10 2022-02 11.666667
13 2021 11 1 2020-10 2022-02 11.666667
14 2021 12 1 2020-10 2022-02 11.666667
15 2022 1 1 2020-10 2022-02 30.000000
16 2022 2 1 2020-10 2022-02 30.000000https://stackoverflow.com/questions/65122462
复制相似问题