考虑下面的例子
library(text2vec)
library(glmnet)
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
text = c("this is a test", "this is another",'hello','what???'),
value = c(200,400,120,300),
output = c('win', 'lose','win','lose'))
> dataframe
# A tibble: 4 × 4
id text value output
<dbl> <chr> <dbl> <chr>
1 1 this is a test 200 win
2 2 this is another 400 lose
3 3 hello 120 win
4 4 what??? 300 lose现在,我可以使用优秀的text2vec来获得与text列相对应的稀疏矩阵。要做到这一点,我只需要遵循text2vec教程:
it_train = itoken(dataframe$text,
ids = dataframe$id,
progressbar = FALSE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
> dtm_train
4 x 7 sparse Matrix of class "dgCMatrix"
hello another what??? a is test this
1 . . . 1 1 1 1
2 . 1 . . 1 . 1
3 1 . . . . . .
4 . . 1 . . . .该dtm稀疏矩阵可以被馈送到ML模型中。但我的问题是:如何也使用value变量?
也就是说,作为glmnet或xgboost中的输入预测器,我希望使用我的稀疏矩阵(来自文本变量),但也使用包含一些有价值信息的value变量。我该怎么做呢?我们可以以某种方式将信息添加到稀疏矩阵中吗?
谢谢!
发布于 2020-02-26 14:53:57
您可以使用sparse.hstacks
import numpy as np
from scipy.sparse import hstack
dtm_train = hstack((dtm_train,np.array(dataframe['value'])[:,None]))请记住,您必须对您的保持数据执行类似的操作!
https://stackoverflow.com/questions/44424849
复制相似问题