文章/答案/技术大牛

发布

社区首页 >问答首页 >训练多项式朴素贝叶斯模型时的记忆误差

问训练多项式朴素贝叶斯模型时的记忆误差
EN

Stack Overflow用户

提问于 2018-02-09 14:34:05

回答 1查看 592关注 0票数 0

我的代码中的训练部分可以按照10^4的大小来处理数据，但是考虑到我的整个数据集包含大约50万条注释，我想用更多的数据来训练它。当我用100,000次评论运行培训师时，我的记忆力似乎已经耗尽了。

我的get_features函数似乎是罪魁祸首。

data = get_data(limit=size)
data = clean_data(data)
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = []
for i in nltk.FreqDist(all_words).most_common(3000):
    word_features.append(i[0])
random.shuffle(data)

def get_features(comment):
    features = {}
    for word in word_features:
        features[word] = (word in set(comment))  # error here
    return features

# I can do it myself like this:
feature_set = [(get_features(comment), category) for
                (comment, category) in data]

# Or use nltk's Lazy Map implementation which arguable does the same thing:
# feature_set = nltk.classify.apply_features(get_features, data, labeled=True)

为100,000评论运行这个程序占用了我所有的32 at内存，并最终在features[word] = (word in set(comment))行上使用一个Memory Error崩溃。

我能做些什么来缓解这个问题？

编辑:我已经大大减少了功能的数量:我现在只使用前300个最常见的单词作为功能-这已经大大提高了性能(出于明显的原因)。我还纠正了@Marat指出的一个小错误。

python

scikit-learn

out-of-memory

nltk

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-02-09 15:42:00

免责声明:这段代码有许多潜在的缺陷，所以我希望很少有迭代能够找到根本原因。

参数不匹配：

# defined with one parameter
def get_features(comment):
    ...

# called with two
... get_features(comment, word_features), ...

次优词查找：

# set(comment) executed on every iteration
for word in word_features:
    features[word] = (word in set(comment))

# can be transformed into something like:
word_set = set(comment)
for word in word_features:
    features[word] = word in word_set

# if typical comment length is < 30, list lookup is faster
for word in word_features:
    features[word] = word in comment

次优特征计算：

# it is cheaper to set few positives than to check all word_features
# also MUCH more memory efficient
from collections import defaultdict
...
def get_features(comment):
    features = defaultdict(bool)
    for word in comment:
        features[word] = True
    return features

次优特征存储：

# numpy array is much more efficient than a list of dicts
# .. and with pandas on top it's even nicer:
import pandas as pd
...
feature_set = pd.DataFrame(
    ({word: True for word in comment}
      for (comment, _) in data),
    columns = word_features
).fillna(False)
feature_set['category'] = [category for (_, category) in data]

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48708135

复制

相似问题

问训练多项式朴素贝叶斯模型时的记忆误差
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问训练多项式朴素贝叶斯模型时的记忆误差EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问训练多项式朴素贝叶斯模型时的记忆误差
EN