文章/答案/技术大牛

发布

社区首页 >问答首页 >如何编写自定义的removePunctuation()函数以更好地处理Unicode字符？

问如何编写自定义的removePunctuation()函数以更好地处理Unicode字符？
EN

Stack Overflow用户

提问于 2013-01-11 15:26:34

回答 2查看 5.3K关注 0票数 8

在tm文本挖掘R-包的源代码中，文件transform.R中有removePunctuation()函数，目前定义为：

function(x, preserve_intra_word_dashes = FALSE)
{
    if (!preserve_intra_word_dashes)
        gsub("[[:punct:]]+", "", x)
    else {
        # Assume there are no ASCII 1 characters.
        x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
        x <- gsub("[[:punct:]]+", "", x)
        gsub("\1", "-", x, fixed = TRUE)
    }
}

我需要解析和挖掘科学会议的一些摘要(摘自他们的网站UTF-8)。摘要包含一些需要删除的unicode字符，特别是在字界。有常见的ASCII标点符号，但也有一些Unicode破折号，Unicode引号，数学符号.

文本中也有URL，需要保留标点符号--字里行间标点符号。tm内置的removePunctuation()函数太激进了.

因此，我需要一个定制的removePunctuation()函数来根据我的要求进行删除。

我的自定义Unicode函数现在看起来是这样的，但是它不像预期的那样工作。我很少使用R，所以在R中完成任务需要一些时间，甚至对于最简单的任务也是如此。

My函数：

corpus <- tm_map(corpus, rmPunc =  function(x){ 
# lookbehinds 
# need to be careful to specify fixed-width conditions 
# so that it can be used in lookbehind

x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ; 
# lookaheads (can use variable-width conditions) 
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;

# remove all strings that consist *only* of punct chars 
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;

}

它不像预期的那样起作用。我想，这根本没什么用。标点符号仍在术语-文档矩阵中，请参见：

 head(Terms(tdm), n=30)

  [1] "<></>"                      "---"                       
  [3] "--,"                        ":</>"                      
  [5] ":()"                        "/)."                       
  [7] "/++"                        "/++,"                      
  [9] "..,"                        "..."                       
 [11] "...,"                       "..)"                       
 [13] "“”,"                        "(|)"                       
 [15] "(/)"                        "(.."                       
 [17] "(..,"                       "()=(|=)."                  
 [19] "(),"                        "()."                       
 [21] "(&)"                        "++,"                       
 [23] "(0°"                        "0.001),"                   
 [25] "0.003"                      "=0.005)"                   
 [27] "0.006"                      "=0.007)"                   
 [29] "000km"                      "0.01)" 
...

所以我的问题是：：

为什么对my函数(){}的调用没有达到预期的效果？怎样才能改善我的功能？
Unicode模式类是否在R的perl兼容正则表达式中支持\P{ASCII}或\P{PUNCT}？我认为PCRE：不支持(默认情况下)：“只有对带有\p的各种Unicode属性的支持是不完整的，尽管支持最重要的属性。”

unicode

text-mining

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-07-13 14:04:07

尽管我很喜欢苏珊娜的回答，但它破坏了tm更新版本的语料库(不再是PlainTextDocument，而是破坏元数据)。

您将得到一个列表和以下错误：

Error in UseMethod("meta", x) : 
no applicable method for 'meta' applied to an object of class "character"

使用

tm_map(your_corpus, PlainTextDocument)

会把你的语料库还给你，但是$meta断了(特别是文档in会丢失。

溶液

使用content_transformer

toSpace <- content_transformer(function(x,pattern)
    gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")

资料来源:手持数据科学与R，文本挖掘，Graham.Williams@togaware.com http://onepager.togaware.com/

更新

此函数删除所有非alpha数字(即UTF-8表情符号等)。

removeNonAlnum <- function(x){
  gsub("[^[:alnum:]^[:space:]]","",x)
}

票数 2

Stack Overflow用户

发布于 2013-04-05 09:08:28

我也有同样的问题，自定义函数无法工作，但实际上下面的第一行必须添加。

问候

苏珊娜

replaceExpressions <- function(x) UseMethod("replaceExpressions", x)

replaceExpressions.PlainTextDocument <- replaceExpressions.character  <- function(x) {
    x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
    x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
    x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
    return(x)
}

notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/14281282

复制

相似问题

问如何编写自定义的removePunctuation()函数以更好地处理Unicode字符？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何编写自定义的removePunctuation()函数以更好地处理Unicode字符？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何编写自定义的removePunctuation()函数以更好地处理Unicode字符？
EN