我有一组列(字)的数据。
df
arg1 predicate
0 PERSON be
1 it Pick
2 details Edit
3 title Display
4 title Display我使用了一个预先训练过的word2vec模型来创建一个新的df,所有的单词都被向量所代替(一维numpy数组)。
get updated_df
updated_df = df.applymap(lambda x: self.filterWords(x))
def filterWords(self, x):
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
if x in model.vocab:
return model[x]
else:
return model['xxxxx']updated_df打印:
arg1 \
0 [0.16992188, -0.48632812, 0.080566406, 0.33593...
1 [0.084472656, -0.0003528595, 0.053222656, 0.09...
2 [0.06347656, -0.067871094, 0.07714844, -0.2197...
3 [0.06640625, -0.032714844, -0.060791016, -0.19...
4 [0.06640625, -0.032714844, -0.060791016, -0.19...
predicate
0 [-0.22851562, -0.088378906, 0.12792969, 0.1503...
1 [0.018676758, 0.28515625, 0.08886719, 0.213867...
2 [-0.032714844, 0.18066406, -0.140625, 0.115722...
3 [0.265625, -0.036865234, -0.17285156, -0.07128...
4 [0.265625, -0.036865234, -0.17285156, -0.07128...我需要训练一个支持向量机(sklearn线性SVC)与这些数据。当我把updated_df作为X_Train传递时,我得到
clf.fit(updated_df, out_df.values.ravel())
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence将此作为输入数据传递给分类器的正确方法是什么?我的y_train很好。如果我得到了像下面这样创建updated_df的单词的散列,它就会工作得很好。
updated_df = df.applymap(lambda x: hash(x))但是我需要传递word2vec向量来建立单词之间的关系。我对python/ML很陌生,我很欣赏这个指导。
根据西德博尔德的建议进行当前状况的编辑:
class ConcatVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(word2vec.itervalues().next())
print "self.dim = ", self.dim
def fit(self, X, y):
print "entering concat embedding fit"
print "fit X.shape = ", X.shape
return self
def transform(self, X):
print "entering concat embedding transform"
print "transform X.shape = ", X.shape
dictionary = {':': 'None', '?': 'None', '': 'None', ' ': 'None'}
X = X.replace(to_replace=[':','?','',' '], value=['None','None','None','None'])
X = X.fillna('None')
print "X = ", X
X_array = X.values
print "X_array = ", X_array
vectorized_array = np.array([
np.concatenate([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X_array
])
print "vectorized array", vectorized_array
print "vectorized array.shape", vectorized_array.shape
return vectorized_array
model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
w2v = {w: vec for w, vec in zip(model.wv.index2word, model.wv.syn0)}
etree_w2v_concat = Pipeline([
("word2vec vectorizer", ConcatVectorizer(w2v)),
("extra trees", ExtraTreesClassifier(n_estimators=200))])
rf.testWordEmbClassifier(etree_w2v_concat)
def testWordEmbClassifier(self, pipe_obj):
kb_fname = 'kb_data_3.csv'
test_fname = 'kb_test_data_3.csv'
kb_data = pd.read_csv(path + kb_fname, usecols=['arg1',
'feature_word_0',
'feature_word_1',
'feature_word_2',
'predicate'])
kb_data_small = kb_data.iloc[0:5]
kb_data_out = pd.read_csv(path + kb_fname, usecols=['output'])
kb_data_out_small = kb_data_out.iloc[0:5]
print kb_data_small
pipe_obj.fit(kb_data_small, kb_data_out_small.values.ravel())
print pipe_obj.predict(kb_data_small)
self.wordemb_predictResult(pipe_obj, test_fname, report=True)发布于 2018-03-18 10:19:31
在我看来,scikit-learn会引发一个错误,因为updated_df是由两个带有列表格式的特性(列)组成的。因此,对于给定的观察,x_i:
x_i = [arg1_i, predicate_i] = [[vector_arg1_i], [vector_predicate_i]].Scikit--学习无法处理这种输入功能的格式。
在Word2Vec文本处理之后,有多种方法可以训练出最高级的机器学习模型。一个常见的方法是求和或平均列arg1和谓词,以便具有以下x_i结构:
x_i = [(arg1_i + predicate_i) / 2] = [(vector_arg_i + vector_predicate_i) / 2]关于文本分类的更多解释和Word2Vec与CountVectorizer特性工程方法的比较:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
https://datascience.stackexchange.com/questions/29141
复制相似问题