首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >R:带有多个正则表达式模式和异常的拆分文本

R:带有多个正则表达式模式和异常的拆分文本
EN

Stack Overflow用户
提问于 2013-09-09 11:16:27
回答 1查看 3.1K关注 0票数 9

想要在句子中拆分字符元素text的向量。有一个更多的分裂标准模式("and/ERT""/$")。此外,模式也有例外(:/$.and/ERT then./$. Smiley)。

尝试:匹配应该分割的情况。在那个地方插入一个不寻常的图案("^&*")。strsplit特定模式

问题:我不知道如何正确处理异常。在运行"^&*"之前,应该消除异常模式( strsplit )并恢复原始文本,这是非常明显的情况。

代码:

代码语言:javascript
复制
text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")

patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")

exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")

# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) # 

# Ideal split:
textsplitted
> textsplitted
[[1]]
 [1] "This are faulty propositions one and/ERT" 
 [2] "two ,/$," 
 [3] "which I want to split ./$."
 [4] "There are cases where I explicitly want and/ERT" 
 [5] "some where I don't want to split ./$." 
 [6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
 [7] "This is also one case where I dont't want to split ./$. Smiley !/$." 
 [8] "Thank you ./$!"

[[2]]
 [1] "This are the same faulty propositions one and/ERT 
 [2] "two ,/$,"
#...      

# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-09-09 12:55:06

我想你可以用这个表达式来达到你想要的分裂。由于strsplit用光了它在上面拆分的字符,您将不得不在空格上拆分,下面是要匹配/不匹配的内容(这是OP中所需输出中的内容):

代码语言:javascript
复制
strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)"  , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"                                 
#[2] "two ,/$,"                                                                 
#[3] "which I want to split ./$."                                               
#[4] "There are cases where I explicitly want and/ERT"                          
#[5] "some where I don't want to split ./$."                                    
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."      
#[8] "Thank you ./$!" 

解释

  • (?<=and/ERT)\\s -在空格上拆分,\\sIS(?<=...) by "and/ERT"
  • (?!then) -,但只有当那个空格是而不是,(?!...)"then"
  • | - OR运算符链接下一个表达式
  • (?<=/\\$[[:punct:]]) - "/$"的正面后视断言,后面跟着任何标点符号
  • (?<!:/\\$[[:punct:]])\\s(?!Smiley) -匹配一个空格,即,而不是,前面是":/$"[[:punct:]] (但是根据前面的点,是,前面是"/$[[:punct:]]"不是(?!...)后面是"Smiley"
票数 10
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/18697005

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档