首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Pandas drop_duplicates()挂在jupyter-notebook -提高drop_duplicates()性能的方法?

Pandas drop_duplicates()挂在jupyter-notebook -提高drop_duplicates()性能的方法?
EN

Stack Overflow用户
提问于 2019-09-14 22:03:10
回答 1查看 141关注 0票数 0

根据评论会话中的建议进行了编辑,目前我已经将问题范围缩小到drop_duplicates(),这会导致函数永远运行。删除drop_duplicates()后,函数可以在很短的时间内转到步骤df_output.to_csv(),但到此为止。我怀疑是复制品导致了这个问题。熊猫专家有什么建议吗?

创建输出的函数如下所示:

代码语言:javascript
复制
def create_output(model, users_to_recommend, n_rec, print_csv=True):
    recomendation = model.recommend(users=users_to_recommend, k=n_rec)
    df_rec = recomendation.to_dataframe()
    df_rec['recommendedProducts'] = df_rec.groupby([user_id])[page_id] \
        .transform(lambda x: '|'.join(x.astype(str)))
    df_output = df_rec[['userID', 'recommendedProducts']].drop_duplicates() \
        .sort_values('userID').set_index('userID')

    if print_csv:
        df_output.to_csv('output/normdata_recommendation.csv')
        print("An output file can be found in 'output' folder with name 'normdata_recommendation.csv'")
    return df_output

调用函数时的输出:

代码语言:javascript
复制
recommendations finished on 1000/617256 queries. users per second: 263089
recommendations finished on 2000/617256 queries. users per second: 179340
recommendations finished on 3000/617256 queries. users per second: 152447
.
.
.
recommendations finished on 615000/617256 queries. users per second: 105123
recommendations finished on 616000/617256 queries. users per second: 104996
recommendations finished on 617000/617256 queries. users per second: 104910

用于创建输出的函数调用:

代码语言:javascript
复制
# constant variables to define field names include:
user_id = 'userID'
page_id = 'pageID'
users_to_recommend = list(page_usage[user_id])
n_rec = 10 # number of items to recommend
n_display = 30 # to display the first few rows in an output dataset

name = 'popularity' # popularity model chosen
target = 'scaled_visit_freq'

popularity = model(train_data_norm, name, user_id, page_id, target, users_to_recommend, n_rec, n_display)

df_output = create_output(popularity, users_to_recommend, n_rec, print_csv=True)

model函数使用Turicreate返回一个选定的模型,使用该模型执行训练,可以成功执行。模型函数的输出:

代码语言:javascript
复制
Preparing data set.
    Data has 20799 observations with 1138 users and 511 items.
    Data prepared in: 0.06968s
20799 observations to process; with 511 unique items.
recommendations finished on 1000/617256 queries. users per second: 270490
recommendations finished on 2000/617256 queries. users per second: 244499
.
.
.
recommendations finished on 615000/617256 queries. users per second: 108578
recommendations finished on 616000/617256 queries. users per second: 108591
recommendations finished on 617000/617256 queries. users per second: 108611
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-09-16 04:55:31

我通过在df_rec = recomendation.to_dataframe()之后添加drop_duplicates()来解决这个问题。这将大大减少计算需求。

对于任何可能认为这些代码片段有用的人,尤其是对于大型数据集,请记住将该行更改为df_rec = recomendation.to_dataframe().drop_duplicates()

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57936232

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档