我很难从我的n-gram中删除亵渎之词。下面的getProfanityWords函数正确地创建了一个字符向量。整个脚本以其他方式工作,但亵渎仍然存在。
我确实想知道这是否与2和3克中的连字符有关,但它也适用于1-gram。
getProfanityWords <- function() {
# Download profanity file to disk if not done so already
profanityFileName <- "profanity.txt"
if (!file.exists(profanityFileName)) {
profanity.url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
download.file(profanity.url, destfile = profanityFileName, method = "curl")
}
# if profanity file not in memory, then load it
if (sum(ls() == "profanity") < 1) {
profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
profanity <- profanity$V1
profanity <- profanity[1:length(profanity)-1]
}
return(profanity)
}
makeSentences <- function(input) {
output <- tokens(input, what = "sentence", remove_numbers = TRUE,
remove_punct = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE,
remove_twitter = TRUE,
remove_symbols = TRUE,
include_docvars = FALSE)
output <- tokens_remove(output, getProfanityWords())
unlist(output)
}
makeNGrams <- function(text, n = 1L) {
tokens(
text,
what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_symbols = TRUE,
ngrams = n
)
}
corpora <- corpus(textData)
sentences <- makeSentences(corpora)
ngram1 <- makeNGrams(sentences, 1)
dfm1 <- dfm(ngram1)
ngram2 <- makeNGrams(sentences, 2)
dfm2 <- dfm(ngram2)
ngram3 <- makeNGrams(sentences, 3)
dfm3 <- dfm(ngram3)我试着添加了
dfm3 <- dfm(ngram3, remove=getProfanityWords())在makeNGrams函数中也类似,但没有区别。
我做错了什么?
谢谢,
克里斯。
发布于 2019-08-30 22:10:37
我想我有个解决方案给你。
tokens_remove是用来删除单词的,而不是用来删除句子部分的。
但是tokens_remove可以很好地处理字典对象。所以第一步就是把亵渎的词放在字典里。
dict <- dictionary(list(bad_words = getProfanityWords()))接下来,您可以在makeNGrams函数中使用包装tokens_remove。
makeNGrams <- function(text, n = 1L) {
out <- tokens_remove(tokens(text), dict)
tokens(
out,
what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_symbols = TRUE,
ngrams = n
)
}这应该会从你的文本中删除亵渎的词。在我为自己创建的简单示例中可以做到这一点。由于亵渎规则,未在此处发布:-)
新增功能
下面是我使用的函数makeSentences。结合上面的代码,它可以像预期的那样工作。我似乎不能重现你的错误。
makeSentences <- function(input) {
output <- tokens(input, what = "sentence", remove_numbers = TRUE,
remove_punct = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE,
remove_twitter = TRUE,
remove_symbols = TRUE,
include_docvars = FALSE)
unlist(output)
}
# txt <- "add profane text example here"
corpora <- corpus(txt)
sentences <- makeSentences(corpora)
ngram1 <- makeNGrams(sentences, 1)
ngram2 <- makeNGrams(sentences, 2)
ngram3 <- makeNGrams(sentences, 3)https://stackoverflow.com/questions/57727865
复制相似问题