我有以下字符串:
example_string <- "In this document, Defined Terms are quotation marks, followed by definition. \"Third Party Software\" is software owned by third parties. \"USA\" the United States of America. \"Breach of Contract\" is in accordance with the Services Description."我想提取每个子字符串,这些子字符串至少有部分大写,并被引号夹在中间。因此,产出应该是:
"Third Party Software" "USA" "Breach of Contract"我从大梁那得到了这么多:
str_extract_all(example_string, "(?:\")\\w(\\s*\\w+)*")
[[1]]
[1] "\"Third Party Software" "\"USA" "\"Breach of Contract"我想不出一种避免匹配开头转义引号\"的方法。我知道我可以在提取定义的术语之后添加一个gsub行来清除它,但是我认为一定有一种方法可以在一个regex调用中完成这一切。
任何建议都非常感谢!
发布于 2020-09-10 12:32:51
在表达式(?:")\w(\s*\w+)*"中,使用非捕获的(?:")组匹配和使用" char。因此,它在匹配值中着陆。
你可能会想用
"(?<=\")\\w(\\s*\\w+)*"其中,(?<=")是一个与紧接在"字符前面的位置相匹配的正向查找。
然而,当您有相同的单字符右、左分隔符时,我宁愿使用捕获方法。
您可以将stringr::str_match_all与
"(\p{Lu}[^"]*)"或者,它也可以是您的模式,稍微修改一下:
"(\p{Lu}\w*(?:\s+\w+)*)"" -a " char(\p{Lu}[^"]*) -捕获组1:\p{Lu} -任何Unicode大写字母[^"]* -除"以外的任何零个或多个字符\w*(?:\s+\w+)* - 0+字母、数字、下划线,然后是0+出现的1+空格,后面跟着1+字母、数字、下划线。" -a " char.library(stringr)
example_string <- "In this document, Defined Terms are quotation marks, followed by definition. \"Third Party Software\" is software owned by third parties. \"USA\" the United States of America. \"Breach of Contract\" is in accordance with the Services Description."
res <- str_match_all(example_string, '"(\\p{Lu}[^"]*)"')
unlist(lapply(res, function(x) x[,-1]))
## => [1] "Third Party Software" "USA" "Breach of Contract"https://stackoverflow.com/questions/63829665
复制相似问题