我正在尝试使用dplyr管道函数并应用text2vec包中的word_tokenizer。
以下是一些数据:
text <- c("Because I could not stop for Death I add additional text-",
"He kindly stopped for me some additional text to act as a filler -",
"The Carriage held but just Ourselves more additional text to add to the body of the text-",
"and Immortality plus some more words to fill the text a little")
ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)
library(text2vec)
df %>%
word_tokenizer(text)发出警告,同时;
df %>%
mutate(word_tokenizer(text))给出了一些输出,但不是我期望的列表格式。
正确的方法是使用word_tokenizer(df$text)。我只是想知道如何使用管道函数做到这一点,因为在这部分之前我还有一些其他的处理。
我还想使用itoken()和create_vocabulary()完成管道。
发布于 2019-09-10 20:46:27
你可以用with做到这一点。关键是要理解管道是如何工作的,以及word_tokenizer是如何工作的。
管道接受左手边的输出,并将其作为第一个参数(默认情况下,但可以是任何参数)传递给右手边的函数(RHS)。word_tokenizer需要一个字符串作为参数。
管道的LHS上有一个dataframe,因此在RHS上,您需要一个函数来接受dataframe作为参数,并可以将该dataframe中的列传递给另一个函数。在本例中,将text字段中的字符串传递给word_tokenizer。with可以做到这一点。
text <- c("Because I could not stop for Death I add additional text-",
"He kindly stopped for me some additional text to act as a filler -",
"The Carriage held but just Ourselves more additional text to add to the body of the text-",
"and Immortality plus some more words to fill the text a little")
ID <- c(1,2,3,4)
output <- c(1,0,0,1)
df <- data.frame(cbind(ID, text, output))
df$text <- as.character(df$text)
library(text2vec)
df %>%
with(word_tokenizer(text))
# [[1]]
# [1] "Because" "I" "could" "not" "stop"
# [6] "for" "Death" "I" "add" "additional"
# [11] "text"
#
# [[2]]
# [1] "He" "kindly" "stopped" "for" "me"
# [6] "some" "additional" "text" "to" "act"
# [11] "as" "a" "filler"
#
# [[3]]
# [1] "The" "Carriage" "held" "but" "just"
# [6] "Ourselves" "more" "additional" "text" "to"
# [11] "add" "to" "the" "body" "of"
# [16] "the" "text"
#
# [[4]]
# [1] "and" "Immortality" "plus" "some"
# [5] "more" "words" "to" "fill"
# [9] "the" "text" "a" "little" 您还询问了如何通过管道将text2vec的输出传输到itoken,以及如何将其输出传输到create_vocabulary。同样,关键是要理解函数LHS返回什么,以及RHS上的函数期望什么。text2vec返回一个列表,而itoken需要一个可迭代的对象;列表是可迭代的,因此只需通过管道将text2vec的输出直接传递给itoken即可。在您的评论中,您试图再次使用with,就好像text2vec的输出是一个数据帧一样。我是通过查看您正在使用的函数的帮助页面来弄清楚这一点的;这向我显示了它们所期望的参数类型。如果您不知道函数返回的类型,可以参考帮助页面或将其输出通过管道传输到class。
library(text2vec)
df %>%
with(word_tokenizer(text)) %>%
itoken() %>%
create_vocabulary()
# |===============================================================| 100%
# Number of docs: 4
# 0 stopwords: ...
# ngram_min = 1; ngram_max = 1
# Vocabulary:
# term term_count doc_count
# 1: Because 1 1
# 2: stop 1 1
# 3: just 1 1
# 4: not 1 1
# 5: Immortality 1 1
# 6: little 1 1
# 7: filler 1 1
# 8: kindly 1 1
# 9: of 1 1
# 10: and 1 1
# 11: plus 1 1
# 12: fill 1 1
# 13: could 1 1
# 14: me 1 1
# 15: Carriage 1 1
# 16: but 1 1
# 17: body 1 1
# 18: stopped 1 1
# 19: as 1 1
# 20: He 1 1
# 21: act 1 1
# 22: The 1 1
# 23: Death 1 1
# 24: words 1 1
# 25: held 1 1
# 26: Ourselves 1 1
# 27: some 2 2
# 28: more 2 2
# 29: I 2 1
# 30: a 2 2
# 31: add 2 2
# 32: for 2 2
# 33: the 3 2
# 34: additional 3 3
# 35: to 4 3
# 36: text 5 4
# term term_count doc_counthttps://stackoverflow.com/questions/57870395
复制相似问题