我正在进行一个案例研究,其中我必须预测每个保单的索赔数量。因为我的变量ClaimNb不是二进制的,所以我不能使用逻辑回归,但我必须使用泊松。我的GLM模型代码:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula= 'ClaimNb ~ BonusMalus+VehAge+Freq+VehGas+Exposure+VehPower+Density+DrivAge'
model = smf.glm(formula = formula, data=df,
family=sm.families.Poisson()) 我还拆分了我的数据
# train-test-split
train , test = train_test_split(data,test_size=0.2,random_state=0)
# seperate the target and independent variable
train_x = train.drop(columns=['ClaimNb'],axis=1)
train_y = train['ClaimNb']
test_x = test.drop(columns=['ClaimNb'],axis=1)
test_y = test['ClaimNb'] 我现在的问题是预测,我使用了以下方法,但不起作用:
from sklearn.linear_model import PoissonRegressor model = PoissonRegressor(alpha=1e-3, max_iter=1000)
model.fit(train_x,train_y)
predict = model.predict(test_x)请问有没有其他方法来预测和检查模型的准确性?
谢谢
发布于 2020-11-29 02:49:46
你需要赋值model.fit()并使用它进行预测,这与sklearn不同。另外,如果你使用这个公式,最好将你的数据帧分成训练和测试,并使用它进行预测。例如:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,(50,4)),columns=['ClaimNb','BonusMalus','VehAge','Freq'])
#X = df[['BonusMalus','VehAge','Freq']]
#y = df['ClaimNb']
df_train = df.sample(round(len(df)*0.8))
df_test = df.drop(df_train.index)
formula= 'ClaimNb ~ BonusMalus+VehAge+Freq'
model = smf.glm(formula = formula, data=df,family=sm.families.Poisson())
result = model.fit()我们可以做预测:
result.predict(df_train)或者:
result.predict(df_test)https://stackoverflow.com/questions/65053036
复制相似问题