我的代码中的训练部分可以按照10^4的大小来处理数据,但是考虑到我的整个数据集包含大约50万条注释,我想用更多的数据来训练它。当我用100,000次评论运行培训师时,我的记忆力似乎已经耗尽了。
我的get_features函数似乎是罪魁祸首。
data = get_data(limit=size)
data = clean_data(data)
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = []
for i in nltk.FreqDist(all_words).most_common(3000):
word_features.append(i[0])
random.shuffle(data)
def get_features(comment):
features = {}
for word in word_features:
features[word] = (word in set(comment)) # error here
return features
# I can do it myself like this:
feature_set = [(get_features(comment), category) for
(comment, category) in data]
# Or use nltk's Lazy Map implementation which arguable does the same thing:
# feature_set = nltk.classify.apply_features(get_features, data, labeled=True)为100,000评论运行这个程序占用了我所有的32 at内存,并最终在features[word] = (word in set(comment))行上使用一个Memory Error崩溃。
我能做些什么来缓解这个问题?
编辑:我已经大大减少了功能的数量:我现在只使用前300个最常见的单词作为功能-这已经大大提高了性能(出于明显的原因)。我还纠正了@Marat指出的一个小错误。
发布于 2018-02-09 15:42:00
免责声明:这段代码有许多潜在的缺陷,所以我希望很少有迭代能够找到根本原因。
参数不匹配:
# defined with one parameter
def get_features(comment):
...
# called with two
... get_features(comment, word_features), ...次优词查找:
# set(comment) executed on every iteration
for word in word_features:
features[word] = (word in set(comment))
# can be transformed into something like:
word_set = set(comment)
for word in word_features:
features[word] = word in word_set
# if typical comment length is < 30, list lookup is faster
for word in word_features:
features[word] = word in comment次优特征计算:
# it is cheaper to set few positives than to check all word_features
# also MUCH more memory efficient
from collections import defaultdict
...
def get_features(comment):
features = defaultdict(bool)
for word in comment:
features[word] = True
return features次优特征存储:
# numpy array is much more efficient than a list of dicts
# .. and with pandas on top it's even nicer:
import pandas as pd
...
feature_set = pd.DataFrame(
({word: True for word in comment}
for (comment, _) in data),
columns = word_features
).fillna(False)
feature_set['category'] = [category for (_, category) in data]https://stackoverflow.com/questions/48708135
复制相似问题