根据评论会话中的建议进行了编辑,目前我已经将问题范围缩小到drop_duplicates(),这会导致函数永远运行。删除drop_duplicates()后,函数可以在很短的时间内转到步骤df_output.to_csv(),但到此为止。我怀疑是复制品导致了这个问题。熊猫专家有什么建议吗?
创建输出的函数如下所示:
def create_output(model, users_to_recommend, n_rec, print_csv=True):
recomendation = model.recommend(users=users_to_recommend, k=n_rec)
df_rec = recomendation.to_dataframe()
df_rec['recommendedProducts'] = df_rec.groupby([user_id])[page_id] \
.transform(lambda x: '|'.join(x.astype(str)))
df_output = df_rec[['userID', 'recommendedProducts']].drop_duplicates() \
.sort_values('userID').set_index('userID')
if print_csv:
df_output.to_csv('output/normdata_recommendation.csv')
print("An output file can be found in 'output' folder with name 'normdata_recommendation.csv'")
return df_output调用函数时的输出:
recommendations finished on 1000/617256 queries. users per second: 263089
recommendations finished on 2000/617256 queries. users per second: 179340
recommendations finished on 3000/617256 queries. users per second: 152447
.
.
.
recommendations finished on 615000/617256 queries. users per second: 105123
recommendations finished on 616000/617256 queries. users per second: 104996
recommendations finished on 617000/617256 queries. users per second: 104910用于创建输出的函数调用:
# constant variables to define field names include:
user_id = 'userID'
page_id = 'pageID'
users_to_recommend = list(page_usage[user_id])
n_rec = 10 # number of items to recommend
n_display = 30 # to display the first few rows in an output dataset
name = 'popularity' # popularity model chosen
target = 'scaled_visit_freq'
popularity = model(train_data_norm, name, user_id, page_id, target, users_to_recommend, n_rec, n_display)
df_output = create_output(popularity, users_to_recommend, n_rec, print_csv=True)model函数使用Turicreate返回一个选定的模型,使用该模型执行训练,可以成功执行。模型函数的输出:
Preparing data set.
Data has 20799 observations with 1138 users and 511 items.
Data prepared in: 0.06968s
20799 observations to process; with 511 unique items.
recommendations finished on 1000/617256 queries. users per second: 270490
recommendations finished on 2000/617256 queries. users per second: 244499
.
.
.
recommendations finished on 615000/617256 queries. users per second: 108578
recommendations finished on 616000/617256 queries. users per second: 108591
recommendations finished on 617000/617256 queries. users per second: 108611发布于 2019-09-16 04:55:31
我通过在df_rec = recomendation.to_dataframe()之后添加drop_duplicates()来解决这个问题。这将大大减少计算需求。
对于任何可能认为这些代码片段有用的人,尤其是对于大型数据集,请记住将该行更改为df_rec = recomendation.to_dataframe().drop_duplicates()
https://stackoverflow.com/questions/57936232
复制相似问题