文章/答案/技术大牛

发布

社区首页 >问答首页 >dplyr管道函数中的word_tokenizer -输出到列表

问dplyr管道函数中的word_tokenizer -输出到列表
EN

Stack Overflow用户

提问于 2019-09-10 20:01:58

回答 1查看 65关注 0票数 2

我正在尝试使用dplyr管道函数并应用text2vec包中的word_tokenizer。

以下是一些数据：

text <- c("Because I could not stop for Death I add additional text-",
          "He kindly stopped for me some additional text to act as a filler -",
          "The Carriage held but just Ourselves more additional text to add to the body of the text-",
          "and Immortality plus some more words to fill the text a little")

ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)



library(text2vec)

df %>%
  word_tokenizer(text)

发出警告，同时；

df %>%
  mutate(word_tokenizer(text))

给出了一些输出，但不是我期望的列表格式。

正确的方法是使用word_tokenizer(df$text)。我只是想知道如何使用管道函数做到这一点，因为在这部分之前我还有一些其他的处理。

我还想使用itoken()和create_vocabulary()完成管道。

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-09-10 20:46:27

你可以用with做到这一点。关键是要理解管道是如何工作的，以及word_tokenizer是如何工作的。

管道接受左手边的输出，并将其作为第一个参数(默认情况下，但可以是任何参数)传递给右手边的函数(RHS)。word_tokenizer需要一个字符串作为参数。

管道的LHS上有一个dataframe，因此在RHS上，您需要一个函数来接受dataframe作为参数，并可以将该dataframe中的列传递给另一个函数。在本例中，将text字段中的字符串传递给word_tokenizer。with可以做到这一点。

text <- c("Because I could not stop for Death I add additional text-",
          "He kindly stopped for me some additional text to act as a filler -",
          "The Carriage held but just Ourselves more additional text to add to the body of the text-",
          "and Immortality plus some more words to fill the text a little")

ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)

library(text2vec)

df %>%
  with(word_tokenizer(text))

# [[1]]
# [1] "Because"    "I"          "could"      "not"        "stop"      
# [6] "for"        "Death"      "I"          "add"        "additional"
# [11] "text"      
# 
# [[2]]
# [1] "He"         "kindly"     "stopped"    "for"        "me"        
# [6] "some"       "additional" "text"       "to"         "act"       
# [11] "as"         "a"          "filler"    
# 
# [[3]]
# [1] "The"        "Carriage"   "held"       "but"        "just"      
# [6] "Ourselves"  "more"       "additional" "text"       "to"        
# [11] "add"        "to"         "the"        "body"       "of"        
# [16] "the"        "text"      
# 
# [[4]]
# [1] "and"         "Immortality" "plus"        "some"       
# [5] "more"        "words"       "to"          "fill"       
# [9] "the"         "text"        "a"           "little"

您还询问了如何通过管道将text2vec的输出传输到itoken，以及如何将其输出传输到create_vocabulary。同样，关键是要理解函数LHS返回什么，以及RHS上的函数期望什么。text2vec返回一个列表，而itoken需要一个可迭代的对象；列表是可迭代的，因此只需通过管道将text2vec的输出直接传递给itoken即可。在您的评论中，您试图再次使用with，就好像text2vec的输出是一个数据帧一样。我是通过查看您正在使用的函数的帮助页面来弄清楚这一点的；这向我显示了它们所期望的参数类型。如果您不知道函数返回的类型，可以参考帮助页面或将其输出通过管道传输到class。

library(text2vec)

df %>%
  with(word_tokenizer(text)) %>%
  itoken() %>%
  create_vocabulary()

# |===============================================================| 100%
# Number of docs: 4 
# 0 stopwords:  ... 
# ngram_min = 1; ngram_max = 1 
# Vocabulary: 
#   term term_count doc_count
# 1:     Because          1         1
# 2:        stop          1         1
# 3:        just          1         1
# 4:         not          1         1
# 5: Immortality          1         1
# 6:      little          1         1
# 7:      filler          1         1
# 8:      kindly          1         1
# 9:          of          1         1
# 10:         and          1         1
# 11:        plus          1         1
# 12:        fill          1         1
# 13:       could          1         1
# 14:          me          1         1
# 15:    Carriage          1         1
# 16:         but          1         1
# 17:        body          1         1
# 18:     stopped          1         1
# 19:          as          1         1
# 20:          He          1         1
# 21:         act          1         1
# 22:         The          1         1
# 23:       Death          1         1
# 24:       words          1         1
# 25:        held          1         1
# 26:   Ourselves          1         1
# 27:        some          2         2
# 28:        more          2         2
# 29:           I          2         1
# 30:           a          2         2
# 31:         add          2         2
# 32:         for          2         2
# 33:         the          3         2
# 34:  additional          3         3
# 35:          to          4         3
# 36:        text          5         4
# term term_count doc_count

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57870395

复制

相似问题

问dplyr管道函数中的word_tokenizer -输出到列表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问dplyr管道函数中的word_tokenizer -输出到列表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问dplyr管道函数中的word_tokenizer -输出到列表
EN