文章/答案/技术大牛

发布

社区首页 >问答首页 >在火车和测试组之间共享的LightFM train_interactions :这将导致不正确的评估，检查您的数据分割。

问在火车和测试组之间共享的LightFM train_interactions :这将导致不正确的评估，检查您的数据分割。
EN

Stack Overflow用户

提问于 2020-04-02 04:03:54

回答 1查看 2K关注 0票数 3

dr:使用Yelp创建一个推荐系统，但是遇到测试交互矩阵和训练交互矩阵共享68个交互。这将导致不正确的评估，请在运行以下代码时检查数据拆分. LightFM错误。

test_auc = auc_score(model,
                    test,
                    #train_interactions=train, #Unable to run with this line uncommented
                    item_features=sparse_features_matrix,
                    num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)

完整故事:使用Yelp数据集构建推荐系统。

离开示例文档(https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html)中为混合协作过滤提供的代码。

我按照以下方式运行代码：

from sklearn.model_selection import train_test_split
from lightfm import LightFM
from scipy import sparse
from lightfm.evaluation import auc_score

train, test = train_test_split(sparse_Rating_Matrix, test_size=0.25,random_state=4)
# Set the number of threads; you can increase this
# if you have more physical cores available.
NUM_THREADS = 2
NUM_COMPONENTS = 100
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6

# Define a new model instance
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
                no_components=NUM_COMPONENTS)

# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
                item_features=sparse_features_matrix,
                epochs=NUM_EPOCHS,
                num_threads=NUM_THREADS)

# Don't forget the pass in the item features again!
train_auc = auc_score(model,
                      train,
                      item_features=sparse_features_matrix,
                      num_threads=NUM_THREADS).mean()
print('Hybrid training set AUC: %s' % train_auc)

test_auc = auc_score(model,
                    test,
                    #train_interactions=train, # Unable to run with this line uncommented
                    item_features=sparse_features_matrix,
                    num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)

我有两个问题：

1)运行未注释的行(train_interactions=train)最初会产生不一致的形状

"test“数据集通过以下代码块进行了修改，以在其下面追加一个零块，直到其尺寸与我的火车数据集的尺寸匹配(根据此建议：https://github.com/lyst/lightfm/issues/369)：

#Add X users to Test so that the number of rows in Train match Test
N = train.shape[0] #Rows in Train set
n,m = test.shape #Rows & columns in Test set

z = np.zeros([(N-n),m]) #Create the necessary rows of zeros with m columns
test = test.todense() #Temporarily convert Test into a numpy array
test = np.vstack((test,z)) #Vertically stack Test on top of the blank users
test = sparse.csr_matrix(test) #Convert back to sparse

2)形状问题解决后，我尝试实现"train_interactions=train“。

但遇到测试交互矩阵和训练交互矩阵共享68个交互。这将导致不正确的评估，请检查数据拆分。。

我不知道如何解决第二个问题。有什么想法吗？

详细信息：

-"sparse_features_matrix“是{项目x类别}的稀疏矩阵，如果一个项目是”意大利语“和”比萨“，那么”意大利语“和”比萨“类在该项目的行中将有一个值"1”. "0“。

-"sparse_Rating_Matrix“是{用户x项}的稀疏矩阵，包含用户对餐厅(项目)的评级值。

04/08/2020最新情况：

LightFM有一个完整的数据库()类对象，您应该在模型评估之前使用它来准备数据集。我发现了一个很棒的github贴子(https://github.com/lyst/lightfm/issues/494)，其中用户Med提供了一个很棒的小测试数据集。

当我通过这个方法准备我的数据时，我能够添加我想要建模的user_features (例如: User_1592喜欢“泰式”、“墨西哥式”、“寿司式”菜系)。

根据Turbo的评论，我使用了LightFM的random_train_test_split方法(最初通过sklearn的train_test_split方法分割数据)，并使用新的火车/测试集运行auc_score，并正确地(据我所知)运行已准备好的模型--我仍然遇到相同的错误代码：

输入：

%%time
(train,test) = random_train_test_split(lightfm_interactions,test_percentage=0.25) #LightFM's method to split
# Don't forget the pass in the item features again!
train_auc = auc_score(model_users,
                      train,
                      user_features=lightfm_user_features_list,
                      num_threads=NUM_THREADS).mean()
print('User_feature training set AUC: %s' % train_auc)

test_auc = auc_score(model_users,
                    test,
                    #train_interactions=train, #Still can't get this to function
                    user_features=lightfm_user_features_list,
                    num_threads=NUM_THREADS).mean()
print('User_feature test set AUC: %s' % test_auc)

如果使用"train_interactions=train“，则输出：

ValueError: Test interactions matrix and train interactions matrix share 435 interactions. This will cause incorrect evaluation, check your data split.

好消息是所以我想如果可用的话坚持使用LightFM的方法是很重要的！

python

recommendation-engine

lightfm

回答 1

Stack Overflow用户

发布于 2020-04-07 15:35:54

LightFM提供了一种拆分数据集的方法，您看过吗？有了它，它可能会起作用。https://making.lyst.com/lightfm/docs/cross_validation.html

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60984051

复制

相似问题

问在火车和测试组之间共享的LightFM train_interactions :这将导致不正确的评估，检查您的数据分割。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在火车和测试组之间共享的LightFM train_interactions :这将导致不正确的评估，检查您的数据分割。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在火车和测试组之间共享的LightFM train_interactions :这将导致不正确的评估，检查您的数据分割。
EN