我有以下数据集:

计算危险率的公式为:
For Year = 1: Hazard_rate(Year) = PD(Year)
For Year > 1: Hazard_rate(Year) = (PD(Year) + Hazard_rate(Year - 1) * (Year - 1)) / (Year)
假设:根据customer_ID,年份是单调的,并且严格>0
由于这个公式是递归的,并且需要前一年的风险率,下面的代码速度很慢,对于大型数据集变得难以管理,有没有一种方法可以向量化这个操作,或者至少让循环更快?
#Calculate the hazard rates
#Initialise an array to collect the hazard rate for each calculation, particularly useful for the recursive nature
#of the formula
hr = []
#Loop through the dataframe, executing the hazard rate formula
#If time_period (year) = 1 then the hazard rate is equal to the pd
for index, row in df.iterrows():
if row["Year"] == 1:
hr.append(row["PD"])
elif row["Year"] > 1:
#Create a row_num variable to indicate what the index is for each unique customer ID
row_num = int(row["Year"])
hr.append((row["PD"] + hr[row_num - 2] * (row["Year"] - 1)) / (row["Year"]))
else:
raise ValueError("Index contains negative or zero values")
#Attach the hazard_rates array to the dataframe
df["hazard_rate"] = hr发布于 2019-11-24 19:21:18
此函数将计算第n个危险率
computed = {1: 0.05}
def func(n, computed = computed):
'''
Parameters:
@n: int, year number
@computed: dictionary with hazard rate already computed
Returns:
computed[n]: n-th hazard rate
'''
if n not in computed:
computed[n] = (df.loc[n,'PD'] + func(n-1, computed)*(n-1))/n
return computed[n]现在让我们计算每一年的危险率:
df.set_index('year', inplace=True)
df['Hazard_rate'] = [func(i) for i in df.index]请注意,该函数并不关心数据帧是否按year排序,但是我假定数据帧是按year索引的。
如果您想要恢复列,只需重置索引:
df.reset_index(inplace=True)随着Customer_ID的引入,这个过程变得更加复杂:
#Function depends upon dataframe passed as argument
def func(df, n, computed):
if n not in computed:
computed[n] = (df.loc[n,'PD'] + func(n-1, computed)*(n-1))/n
return computed[n]
#Set index
df.set_index('year', inplace=True)
#Initialize Hazard_rate column
df['Hazard_rate']=0
#Iterate over each customer
for c in df['Customer_ID']:
#Create a customer mask
c_mask = (df['Customer_ID'] == c)
# Initialize computed dictionary for given customer
c_computed = {1: df.loc[c_mask].loc[1,'PD']}
df.loc[c_mask]['Hazard_rate'] = [func(df.loc[c_mask], i, c_computed ) for i in df.loc[c_mask].index]https://stackoverflow.com/questions/59016561
复制相似问题