文章/答案/技术大牛

发布

社区首页 >问答首页 >H20:如何在文本数据上使用梯度增强？

问H20:如何在文本数据上使用梯度增强？
EN

Stack Overflow用户

提问于 2017-06-14 21:28:08

回答 1查看 502关注 0票数 2

我正在尝试实现一个非常简单的ML学习问题，其中我使用文本来预测一些结果。在R中，一些基本的例子是：

导入一些假的但有趣的文本数据

library(caret)
library(dplyr)
library(text2vec)

dataframe <- data_frame(id = c(1,2,3,4),
                        text = c("this is a this", "this is 
                        another",'hello','what???'),
                        value = c(200,400,120,300),
                        output = c('win', 'lose','win','lose'))

> dataframe
# A tibble: 4 x 4
     id            text value output
  <dbl>           <chr> <dbl>  <chr>
1     1  this is a this   200    win
2     2 this is another   400   lose
3     3           hello   120    win
4     4         what???   300   lose

使用text2vec获取文本的稀疏矩阵表示(另请参阅https://github.com/dselivanov/text2vec/blob/master/vignettes/text-vectorization.Rmd)

#these are text2vec functions to tokenize and lowercase the text
prep_fun = tolower
tok_fun = word_tokenizer 

#create the tokens
train_tokens = dataframe$text %>% 
  prep_fun %>% 
  tok_fun

it_train = itoken(train_tokens)     
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)

> dtm_train
4 x 6 sparse Matrix of class "dgCMatrix"
  what hello another a is this
1    .     .       . 1  1    2
2    .     .       1 .  1    1
3    .     1       . .  .    .
4    1     .       . .  .    .

最后，训练algo (例如，使用caret)使用我的稀疏矩阵来预测output。

mymodel <- train(x=dtm_train, y =dataframe$output, method="xgbTree")

> confusionMatrix(mymodel)
Bootstrapped (25 reps) Confusion Matrix 

(entries are percentual average cell counts across resamples)

          Reference
Prediction lose  win
      lose 17.6 44.1
      win  29.4  8.8

 Accuracy (average) : 0.264

我的问题是：

我了解了如何使用spark_read_csv、rsparkling和as_h2o_frame将数据导入到as_h2o_frame中。然而，对于上面的第2和第3点，我完全迷失了方向。

有谁能给我一些提示或者告诉我这种方法在h2o中是否可行？

非常感谢！！

apache-spark

h2o

sparklyr

text2vec

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-06-15 06:25:22

您可以解决这两种方法中的一种--首先在R中，然后转移到H2O进行建模或2。完全在H2O中使用H2O的word2vec实现。

使用R data.frames和text2vec，然后将稀疏矩阵转换为H2O框架，并在H2O中进行建模。

 # Use same code as above to get to this point, then:

 # Convert dgCMatrix to H2OFrame, cbind the response col
 train <- as.h2o(dtm_train)
 train$y <- as.h2o(dataframe$output)

 # Train any H2O model (e.g GBM)
 mymodel <- h2o.gbm(y = "y", training_frame = train,
                   distribution = "bernoulli", seed = 1)

或者，您可以将word2vec嵌入到H2O中，将其应用于文本以获得稀疏矩阵的等效值。然后对H2O机器学习模型进行训练。稍后，我将尝试使用您的数据编辑这个答案，但同时，这里有一个示例演示了在R中使用H2O的word2vec功能。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44555015

复制

相似问题

问H20:如何在文本数据上使用梯度增强？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问H20:如何在文本数据上使用梯度增强？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问H20:如何在文本数据上使用梯度增强？
EN