我正在用R进行文本分析,有没有办法用tm或stringi删除大写中的所有单词?
如果我有这样的东西
Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags转换成
Albert Einstein Nikola Tesla诚挚的问候
发布于 2016-05-03 19:56:19
只需使用grep和正则表达式:
words <- 'Albert Einstein went to the store and saw his friend Nikola Tesla'
# split to vector of individual words
vec <- unlist(strsplit(words, ' '))
# just the capitalized ones
caps <- grep('^[A-Z]', vec, value = T)
# assemble back to a single string, if you want
paste(caps, collapse=' ')发布于 2016-05-03 19:58:59
您可以使用简单的正则表达式删除这些单词。
gsub("\\b[a-z]+\\s+", "", x)
# [1] "Albert Einstein Nikola Tesla"这只是在寻找一个单词边界>小写字母>它后面的所有字母>它后面的所有空格并移除它。
虽然在使用don't这样的单词时,您需要更复杂的正则表达式。有点像
x <- "if Albert Einstein didn't see his friend Nikola Tesla leavin'"
gsub("\\b[a-z][^ ]*(\\s+)?", "", x)
# [1] "Albert Einstein Nikola Tesla "https://stackoverflow.com/questions/37013143
复制相似问题