首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >训练多项式朴素贝叶斯模型时的记忆误差

训练多项式朴素贝叶斯模型时的记忆误差
EN

Stack Overflow用户
提问于 2018-02-09 14:34:05
回答 1查看 592关注 0票数 0

我的代码中的训练部分可以按照10^4的大小来处理数据,但是考虑到我的整个数据集包含大约50万条注释,我想用更多的数据来训练它。当我用100,000次评论运行培训师时,我的记忆力似乎已经耗尽了。

我的get_features函数似乎是罪魁祸首。

代码语言:javascript
复制
data = get_data(limit=size)
data = clean_data(data)
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = []
for i in nltk.FreqDist(all_words).most_common(3000):
    word_features.append(i[0])
random.shuffle(data)

def get_features(comment):
    features = {}
    for word in word_features:
        features[word] = (word in set(comment))  # error here
    return features

# I can do it myself like this:
feature_set = [(get_features(comment), category) for
                (comment, category) in data]

# Or use nltk's Lazy Map implementation which arguable does the same thing:
# feature_set = nltk.classify.apply_features(get_features, data, labeled=True)

100,000评论运行这个程序占用了我所有的32 at内存,并最终在features[word] = (word in set(comment))行上使用一个Memory Error崩溃。

我能做些什么来缓解这个问题?

编辑:我已经大大减少了功能的数量:我现在只使用前300个最常见的单词作为功能-这已经大大提高了性能(出于明显的原因)。我还纠正了@Marat指出的一个小错误。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-02-09 15:42:00

免责声明:这段代码有许多潜在的缺陷,所以我希望很少有迭代能够找到根本原因。

参数不匹配:

代码语言:javascript
复制
# defined with one parameter
def get_features(comment):
    ...

# called with two
... get_features(comment, word_features), ...

次优词查找:

代码语言:javascript
复制
# set(comment) executed on every iteration
for word in word_features:
    features[word] = (word in set(comment))

# can be transformed into something like:
word_set = set(comment)
for word in word_features:
    features[word] = word in word_set

# if typical comment length is < 30, list lookup is faster
for word in word_features:
    features[word] = word in comment

次优特征计算:

代码语言:javascript
复制
# it is cheaper to set few positives than to check all word_features
# also MUCH more memory efficient
from collections import defaultdict
...
def get_features(comment):
    features = defaultdict(bool)
    for word in comment:
        features[word] = True
    return features

次优特征存储:

代码语言:javascript
复制
# numpy array is much more efficient than a list of dicts
# .. and with pandas on top it's even nicer:
import pandas as pd
...
feature_set = pd.DataFrame(
    ({word: True for word in comment}
      for (comment, _) in data),
    columns = word_features
).fillna(False)
feature_set['category'] = [category for (_, category) in data]
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/48708135

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档