文章/答案/技术大牛

发布

社区首页 >问答首页 >使用单词向量数据作为支持向量机的输入特征

问使用单词向量数据作为支持向量机的输入特征
EN

Data Science用户

提问于 2018-03-16 01:35:16

回答 1查看 6.5K关注 0票数 0

我有一组列(字)的数据。

df

        arg1 predicate
    0   PERSON        be
    1       it      Pick
    2  details      Edit
    3    title   Display
    4    title   Display

我使用了一个预先训练过的word2vec模型来创建一个新的df，所有的单词都被向量所代替(一维numpy数组)。

 get updated_df

    updated_df = df.applymap(lambda x: self.filterWords(x))
    def filterWords(self, x):
        model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
        if x in model.vocab:
            return model[x]
        else:
            return model['xxxxx']

updated_df打印：

             arg1  \
        0  [0.16992188, -0.48632812, 0.080566406, 0.33593...   
        1  [0.084472656, -0.0003528595, 0.053222656, 0.09...   
        2  [0.06347656, -0.067871094, 0.07714844, -0.2197...   
        3  [0.06640625, -0.032714844, -0.060791016, -0.19...   
        4  [0.06640625, -0.032714844, -0.060791016, -0.19...   

                                                   predicate  
        0  [-0.22851562, -0.088378906, 0.12792969, 0.1503...  
        1  [0.018676758, 0.28515625, 0.08886719, 0.213867...  
        2  [-0.032714844, 0.18066406, -0.140625, 0.115722...  
        3  [0.265625, -0.036865234, -0.17285156, -0.07128...  
        4  [0.265625, -0.036865234, -0.17285156, -0.07128...

我需要训练一个支持向量机(sklearn线性SVC)与这些数据。当我把updated_df作为X_Train传递时，我得到

clf.fit(updated_df, out_df.values.ravel())    
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence

将此作为输入数据传递给分类器的正确方法是什么？我的y_train很好。如果我得到了像下面这样创建updated_df的单词的散列，它就会工作得很好。

updated_df = df.applymap(lambda x: hash(x))

但是我需要传递word2vec向量来建立单词之间的关系。我对python/ML很陌生，我很欣赏这个指导。

根据西德博尔德的建议进行当前状况的编辑：

class ConcatVectorizer(object):
def __init__(self, word2vec):
    self.word2vec = word2vec
    # if a text is empty we should return a vector of zeros
    # with the same dimensionality as all the other vectors
    self.dim = len(word2vec.itervalues().next())
    print "self.dim = ", self.dim

def fit(self, X, y):
    print "entering concat embedding fit"
    print "fit X.shape = ", X.shape
    return self

def transform(self, X):
    print "entering concat embedding transform"
    print "transform X.shape = ", X.shape
    dictionary = {':': 'None', '?': 'None', '': 'None', ' ': 'None'}
    X = X.replace(to_replace=[':','?','',' '], value=['None','None','None','None'])
    X = X.fillna('None')
    print "X = ", X
    X_array = X.values
    print "X_array = ", X_array

    vectorized_array = np.array([
        np.concatenate([self.word2vec[w] for w in words if w in self.word2vec]
                or [np.zeros(self.dim)], axis=0)
        for words in X_array
    ])

    print "vectorized array", vectorized_array
    print "vectorized array.shape", vectorized_array.shape
    return vectorized_array


model = gensim.models.KeyedVectors.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)
    w2v = {w: vec for w, vec in zip(model.wv.index2word, model.wv.syn0)}
etree_w2v_concat = Pipeline([
    ("word2vec vectorizer", ConcatVectorizer(w2v)),
    ("extra trees", ExtraTreesClassifier(n_estimators=200))])
rf.testWordEmbClassifier(etree_w2v_concat)

       def testWordEmbClassifier(self, pipe_obj):
    kb_fname = 'kb_data_3.csv'
    test_fname = 'kb_test_data_3.csv'
    kb_data = pd.read_csv(path + kb_fname, usecols=['arg1',
                                                        'feature_word_0',
                                                        'feature_word_1',
                                                        'feature_word_2',
                                                        'predicate'])
    kb_data_small = kb_data.iloc[0:5]
    kb_data_out = pd.read_csv(path + kb_fname, usecols=['output'])
    kb_data_out_small = kb_data_out.iloc[0:5]
    print kb_data_small
    pipe_obj.fit(kb_data_small, kb_data_out_small.values.ravel())
    print pipe_obj.predict(kb_data_small)
    self.wordemb_predictResult(pipe_obj, test_fname, report=True)

python

scikit-learn

pandas

svm

numpy

回答 1

Data Science用户

回答已采纳

发布于 2018-03-18 10:19:31

在我看来，scikit-learn会引发一个错误，因为updated_df是由两个带有列表格式的特性(列)组成的。因此，对于给定的观察，x_i：

x_i = [arg1_i, predicate_i] = [[vector_arg1_i], [vector_predicate_i]].

Scikit--学习无法处理这种输入功能的格式。

在Word2Vec文本处理之后，有多种方法可以训练出最高级的机器学习模型。一个常见的方法是求和或平均列arg1和谓词，以便具有以下x_i结构：

x_i = [(arg1_i + predicate_i) / 2] = [(vector_arg_i + vector_predicate_i) / 2]

关于文本分类的更多解释和Word2Vec与CountVectorizer特性工程方法的比较：

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

票数 1

页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://datascience.stackexchange.com/questions/29141

复制

相似问题

问使用单词向量数据作为支持向量机的输入特征
EN

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用单词向量数据作为支持向量机的输入特征EN

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用单词向量数据作为支持向量机的输入特征
EN