在tm文本挖掘R-包的源代码中,文件transform.R中有removePunctuation()函数,目前定义为:
function(x, preserve_intra_word_dashes = FALSE)
{
if (!preserve_intra_word_dashes)
gsub("[[:punct:]]+", "", x)
else {
# Assume there are no ASCII 1 characters.
x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
x <- gsub("[[:punct:]]+", "", x)
gsub("\1", "-", x, fixed = TRUE)
}
}我需要解析和挖掘科学会议的一些摘要(摘自他们的网站UTF-8)。摘要包含一些需要删除的unicode字符,特别是在字界。有常见的ASCII标点符号,但也有一些Unicode破折号,Unicode引号,数学符号.
文本中也有URL,需要保留标点符号--字里行间标点符号。tm内置的removePunctuation()函数太激进了.
因此,我需要一个定制的removePunctuation()函数来根据我的要求进行删除。
我的自定义Unicode函数现在看起来是这样的,但是它不像预期的那样工作。我很少使用R,所以在R中完成任务需要一些时间,甚至对于最简单的任务也是如此。
My函数:
corpus <- tm_map(corpus, rmPunc = function(x){
# lookbehinds
# need to be careful to specify fixed-width conditions
# so that it can be used in lookbehind
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ;
# lookaheads (can use variable-width conditions)
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;
# remove all strings that consist *only* of punct chars
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;
}它不像预期的那样起作用。我想,这根本没什么用。标点符号仍在术语-文档矩阵中,请参见:
head(Terms(tdm), n=30)
[1] "<></>" "---"
[3] "--," ":</>"
[5] ":()" "/)."
[7] "/++" "/++,"
[9] "..," "..."
[11] "...," "..)"
[13] "“”," "(|)"
[15] "(/)" "(.."
[17] "(..," "()=(|=)."
[19] "()," "()."
[21] "(&)" "++,"
[23] "(0°" "0.001),"
[25] "0.003" "=0.005)"
[27] "0.006" "=0.007)"
[29] "000km" "0.01)"
...所以我的问题是::
\P{ASCII}或\P{PUNCT}?我认为PCRE:不支持(默认情况下):“只有对带有\p的各种Unicode属性的支持是不完整的,尽管支持最重要的属性。”发布于 2015-07-13 14:04:07
尽管我很喜欢苏珊娜的回答,但它破坏了tm更新版本的语料库(不再是PlainTextDocument,而是破坏元数据)。
您将得到一个列表和以下错误:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"使用
tm_map(your_corpus, PlainTextDocument)会把你的语料库还给你,但是$meta断了(特别是文档in会丢失。
溶液
使用content_transformer
toSpace <- content_transformer(function(x,pattern)
gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")资料来源:手持数据科学与R,文本挖掘,Graham.Williams@togaware.com http://onepager.togaware.com/
更新
此函数删除所有非alpha数字(即UTF-8表情符号等)。
removeNonAlnum <- function(x){
gsub("[^[:alnum:]^[:space:]]","",x)
}发布于 2013-04-05 09:08:28
我也有同样的问题,自定义函数无法工作,但实际上下面的第一行必须添加。
问候
苏珊娜
replaceExpressions <- function(x) UseMethod("replaceExpressions", x)
replaceExpressions.PlainTextDocument <- replaceExpressions.character <- function(x) {
x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
return(x)
}
notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)https://stackoverflow.com/questions/14281282
复制相似问题