我有一个成分数据集,每一行都是用逗号分隔的成分列表,例如:
燕麦(24%) (卷曲,麸皮),椰子(13%) (椰子,防腐剂(220,223)),棕色糖,牛奶固体,金色糖浆(10%),种子(9%) (芝麻,向日葵),人造奶油(植物油,水,盐,乳化剂(471,大豆卵磷脂),抗氧化剂(307),葡萄糖,乳化剂,乳化剂(5%) (糖,植物油,牛奶固体,可可粉,乳化剂(大豆卵磷脂,492),天然风味)
我希望解析该文件,以用分号替换括号中的逗号。括号内可以有任意数目的括号和任意数量的逗号。结果应该如下所示:
燕麦(24%) (卷曲;麸皮)、椰子(13%) (椰子;防腐剂(220;223))、棕色糖、牛奶固体、金色糖浆(10%)、种子(9%) (芝麻;向日葵)、人造黄油(植物油;水;盐;乳化剂(471;大豆卵磷脂);抗氧化剂(307);葡萄糖、牛奶复合剂(5%) (糖;植物油;牛奶固体;可可粉;乳化剂(大豆卵磷脂;492);天然风味),天然风味
我能得到一些关于正则表达式的帮助来解决这个问题吗?提前谢谢你。
发布于 2021-09-15 07:00:47
您可以使用?R类似。
i <- gregexpr("\\(([^()]|(?R))*\\)", s, perl=TRUE)
regmatches(s, i)[[1]] <- gsub(",", ";", regmatches(s, i)[[1]])
s
#[1] "Oats (24%) (Rolled; Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ; Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour"其中a(?R)z是一个递归,它匹配一个或多个字母a,后面跟着完全相同的字母z。
数据
s <- "Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour"发布于 2021-09-15 03:51:26
1) gsubfn --无需使用gsubfn的复杂正则表达式即可完成。由点组成的正则表达式与单个字符匹配。然后,对于输入字符向量中的每个字符串,pre函数将计数器k初始化为0,然后,对于每个匹配项,运行fun,并通过x参数将该字符传递给它。在fun中,计数器k每次遇到(时递增1,每次遇到)时减少1。如果计数器不是零,遇到逗号,则返回分号替换逗号;否则,返回输入字符。这是矢量化的,也就是说,如果输入的s是一个字符向量,那么它也能工作,其中每个组件都应该被分开处理。
library(gsubfn)
p <- proto(k = 0,
pre = function(this) this$k <- 0,
fun = function(this, x) {
if (x == "(") this$k <- k + 1
if (x == ")") this$k <- k - 1
if (k && x == ",") ";" else x
})
gsubfn(".", p, s)给予:
[1] "Oats (24%) (Rolled; Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ; Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour"2) Base R解决方案是将输入拆分为单个字符,给出字符向量列表L.然后对L的每个分量chars创建一个计数器向量k,其长度与chars相同,后者表示(到该点的个数减去)到该点的个数。然后用分号替换与非零k对应的逗号,并将chars转换回单个字符串。类似于(1)这适用于字符向量。
L <- strsplit(s, "")
sapply(L, function(chars) {
k <- cumsum((chars == "(") - (chars == ")"))
chars[k & chars == ","] <- ";"
paste(chars, collapse = "")
})备注
输入字符串s如下所示。
s <- "Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour"https://stackoverflow.com/questions/69186679
复制相似问题