我有一个看起来有点像这样的数据集:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)我想要一个数字,在那里我可以看到某些重大事件的发生。所以让我说我有:
trigger_bg_1 <- "sample text"我期望输出2(因为在两个句子中有两个“示例文本”出现)。我知道怎么做这样的单词计数:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}但我找不到什么东西能帮我搞定。对于如何修改代码以使其正常工作,有什么想法吗?
但是由于我需要对触发词进行长时间的测试,所以我需要计算一下。
发布于 2020-07-13 13:40:28
如果您想要计算匹配的句子,可以使用grep
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2如果您想要计算您找到trigger_bg_1的次数,您可以使用gregexpr
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2发布于 2020-07-13 13:34:52
你可以sum一个grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2发布于 2020-07-13 15:03:16
如果您真的对bigram感兴趣,而不仅仅是设置单词组合,那么quanteda包可以提供一个更充实和更系统的前进方向:
数据:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)解决方案:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)结果:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1 如果您想检查某一特定信号的频率计数:
bigrams["in sentence"]
in sentence
2https://stackoverflow.com/questions/62876572
复制相似问题