我想在我的df中添加一列,以显示CurrentScore与对应于同一日期、扇区和分类的基本分数之间的差异。基分数在一个名为base_score_df的单独数据中,其索引是日期。如果base_score_df丢失了当天的基本分数,我希望结果为null。
主要的df:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': '2022-2-1 2022-2-1 2022-2-2 2022-2-2 2022-2-2 2022-2-3 2022-2-3 2022-2-3'.split(),
'Name': 'Walmart Google Walmart Microsoft Target Walmart Google Microsoft'.split(),
'Sector': 'Retail Tech Retail Tech Retail Retail Tech Tech'.split(),
'Classification': '3 4 3 5 5 4 4 4'.split(),
'CurrentScore': '200 197 202 188 186 193 202 201'.split()
})
print(df)
Date Name Sector Classification CurrentScore
0 2022-2-1 Walmart Retail 3 200
1 2022-2-1 Google Tech 4 197
2 2022-2-2 Walmart Retail 3 202
3 2022-2-2 Microsoft Tech 5 188
4 2022-2-2 Target Retail 5 186
5 2022-2-3 Walmart Retail 4 193
6 2022-2-3 Google Tech 4 202
7 2022-2-3 Microsoft Tech 4 201base_score_df:
base_score_df=pd.DataFrame({'Date': '2022-2-1 2022-2-3'.split(),
'Retail 3': '100 97'.split(),
'Retail 4': '102 100'.split(),
'Retail 5': '103 101'. split(),
'Tech 3': '105 107'.split(),
'Tech 4': '110 109'.split(),
'Tech 5': '112 113'.split()
})
base_score_df.set_index(['Date'], inplace=True)
print(base_score_df)
Retail 3 Retail 4 Retail 5 Tech 3 Tech 4 Tech 5
Date
2022-2-1 100 102 103 105 110 112
2022-2-3 97 100 101 107 109 113我的解决方案是:(1)将扇区和分类法串联成“扇区分类”列,(2)使用for循环、迭代和Classification ()逐行查找基分数,将其放入df中的新的“基本分数”列,(3)计算另一列中的差异
第(2)款代码:
for row in df.iterruples(index=False,name='SP'):
def base_score_lookup(row):
scoredate=row['Date'],
header=row['Sector Classification']
return base_score_df.loc[scoredate,header]
base_score_df['Base Score']=df.apply(base_score_lookup,axis=1)问题是,如果base_score_df中缺少日期,代码就不会运行。在这种情况下,我只想使用一个空值,然后转到下一行。我想知道,为了更快的速度,代码可以用不同的方式来编写。提前谢谢。
发布于 2022-02-07 02:40:03
下面是你能做的,评论中的解释:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': '2022-2-1 2022-2-1 2022-2-2 2022-2-2 2022-2-2 2022-2-3 2022-2-3 2022-2-3'.split(),
'Name': 'Walmart Google Walmart Microsoft Target Walmart Google Microsoft'.split(),
'Sector': 'Retail Tech Retail Tech Retail Retail Tech Tech'.split(),
'Classification': '3 4 3 5 5 4 4 4'.split(),
'CurrentScore': '200 197 202 188 186 193 202 201'.split()
})
base_score_df=pd.DataFrame({'Date': '2022-2-1 2022-2-3'.split(),
'Retail 3': '100 97'.split(),
'Retail 4': '102 100'.split(),
'Retail 5': '103 101'. split(),
'Tech 3': '105 107'.split(),
'Tech 4': '110 109'.split(),
'Tech 5': '112 113'.split()
})
# ensure date column is in the same format
df['Date'] = pd.to_datetime(df.Date)
base_score_df['Date'] = pd.to_datetime(base_score_df.Date)
# melt the base score df into a long format
base_score_df = pd.melt(base_score_df,
id_vars=['Date'],
value_vars=[_ for _ in base_score_df.columns if _ != 'Date'])
base_score_df.columns = ['Date', 'category', 'BaseScore']
# split the category into Sector and Classification
base_score_df['Sector'], base_score_df['Classification'] = zip(*base_score_df.category.str.split(' '))
base_score_df.drop('category', axis=1, inplace=True)
# merge back with original dataframe
df = pd.merge(df,
base_score_df,
on=['Date', 'Sector', 'Classification'],
how='left')
# calculate score difference
df['ScoreDiff'] = df['CurrentScore'].astype(float) - df['BaseScore'].astype(float)
# output
df
Date Name Sector Classification CurrentScore BaseScore ScoreDiff
0 2022-02-01 Walmart Retail 3 200 100 100.0
1 2022-02-01 Google Tech 4 197 110 87.0
2 2022-02-02 Walmart Retail 3 202 NaN NaN
3 2022-02-02 Microsoft Tech 5 188 NaN NaN
4 2022-02-02 Target Retail 5 186 NaN NaN
5 2022-02-03 Walmart Retail 4 193 100 93.0
6 2022-02-03 Google Tech 4 202 109 93.0
7 2022-02-03 Microsoft Tech 4 201 109 92.0https://stackoverflow.com/questions/71012715
复制相似问题