文章/答案/技术大牛

发布

社区首页 >问答首页 >带有插入符号问题的Text2Vec分类

问带有插入符号问题的Text2Vec分类
EN

Stack Overflow用户

提问于 2016-08-04 21:19:11

回答 1查看 1.1K关注 0票数 4

一些上下文：Working with text classification and big sparse matrices in R

我一直在使用text2vec包和caret解决文本多类分类问题。我们的计划是使用text2vec构建文档-术语矩阵，修剪词汇表和各种预处理内容，然后使用caret尝试不同的模型，但我无法获得结果，因为在训练时，脱字符抛出一些错误，如下所示：

+ Fold02.Rep1: cost=0.25 
predictions failed for Fold01.Rep1: cost=0.25 Error in as.vector(data) : 
no method for coercing this S4 class to a vector

所有的折叠和重复都会发生这种情况。我推测在将text2vec生成的文档术语矩阵转换为向量时存在问题，因为插入符号需要进行一些计算，但老实说我不确定，这就是产生这个问题的主要原因。

使用的代码，以及一些跳过的部分，如下所示。请注意，我将text2vec返回的文档术语矩阵的直接结果提供给caret，我不能完全确定这是否正确。

library(text2vec)
library(caret)
data("movie_review")
train = movie_review[1:4000, ]
test = movie_review[4001:5000, ]

it <- itoken(train$review, preprocess_function = tolower, tokenizer = word_tokenizer)
vocab <- create_vocabulary(it, stopwords = tokenizers::stopwords())
pruned_vocab <- prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001)

vectorizer <- vocab_vectorizer(pruned_vocab)
it = itoken(train$review, tokenizer = word_tokenizer, ids = train$id)
dtm_train = create_dtm(it, vectorizer)
it = itoken(test$review, tokenizer = word_tokenizer, ids = test$id)
dtm_test = create_dtm(it, vectorizer)

ctrl.svm.1 <- trainControl(method="repeatedcv",
                           number=10,
                           repeats=5,
                           summaryFunction = multiClassSummary,
                           verboseIter = TRUE)

fit.svm.1 <- train(x = dtm_train, y= as.factor(train$sentiment), 
                   method="svmLinear2",  
                   metric="Accuracy", 
                   trControl = ctrl.svm.1, 
                   scale = FALSE, verbose = TRUE)

正如我所说的，这个问题出现在启动train()函数时。dtm_train对象属于以下类：

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

结构看起来像这样：

str(dtm_train)
> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:368047] 2582 2995 3879 3233 2118 2416 2468 2471 3044 3669 ...
  ..@ p       : int [1:6566] 0 0 3 4 4 10 10 14 14 22 ...
  ..@ Dim     : int [1:2] 4000 6565
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:4000] "5814_8" "2381_9" "7759_3" "3630_4" ...
  .. ..$ : chr [1:6565] "floriane" "lil" "elm" "kolchak" ...
  ..@ x       : num [1:368047] 1 1 1 1 1 1 2 2 1 3 ...
  ..@ factors : list()

我做错了什么？为什么插入符号不能处理这类数据，如果在文档中它暗示可以这样做？

svm

r-caret

text-classification

text2vec

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-08-06 16:15:18

如果你把你的S4类dtm_train转换成一个简单的矩阵，代码就可以工作了。

fit.svm.1 <- train(x = as.matrix(dtm_train), y= as.factor(train$sentiment), 
                   method="svmLinear2",  
                   metric="Accuracy", 
                   trControl = ctrl.svm.1, 
                   scale = FALSE, verbose = TRUE)

别忘了对你的dtm_test做同样的事情，否则预测函数也会报错。

pred <- predict(fit.svm.1, newdata = as.matrix(dtm_test)

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/38768499

复制

相似问题

问带有插入符号问题的Text2Vec分类
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问带有插入符号问题的Text2Vec分类EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问带有插入符号问题的Text2Vec分类
EN