我正试图编写一个代码来创建一个新的列,该列将用一个数字标记相应的化合物。我有在列表中重复的化合物,我需要用相同的数字来标记这些化合物,但是要用一个字母来分隔这些化合物。我不知道该怎么编码。谢谢,例子如下:
Fructose 1
Maltose 2
Sucrose 3
Sucrose 4想要什么:
Fructose 1
Maltose 2
Sucrose 3
Sucrose 3b我不能手工标记每个化合物,因为我有这么大的数据集。
发布于 2015-02-26 23:12:02
下面是我如何使用R和data.table包来完成这个任务。
首先,我们将通过compound对数据进行(并排序)。然后,我们将创建自己的索引,并将字母添加到它的dupes (虽然不确定如何处理大于26的组)
library(data.table)
setkey(setDT(df), compound)[, indx := as.character(.GRP), by = compound]
df[duplicated(df), indx := paste0(indx, letters[seq_len(.N)])]
df
# compound number indx
# 1: Fructose 1 1
# 2: Maltose 2 2
# 3: Sucrose 3 3
# 4: Sucrose 4 3a发布于 2015-02-26 22:49:21
以下是您问题的解决方案:
您可以阅读有关按组进行数据步骤处理以提高您的SAS技能。
下面是完整的工作示例:
data have;
length Carb $10;
input Carb;
datalines;
Fructose
Maltose
Sucrose
Sucrose
Sucrose
Pasta
Pasta
Rice
Rice
Rice
Quinoa
Bread
;
proc format;
value dupFormat
1 = 'b'
2 = 'c'
3 = 'd'
;
run;
proc sort data=have;
by Carb;
run;
data want(keep=Carb Number);
length Carb $10;
length Number $3;
set have;
by Carb;
/* nCarbs is the number of distinct carbs written so far */
if _n_=1 then nCarbs = 0;
if first.Carb then do;
nCarbs+1;
count_dup = 0; /* the number of duplicate records for the current cab */
Number = left(put(nCarbs,3.));
end;
else do;
count_dup+1;
Number = cats(put(nCarbs,3.), put(count_dup, dupFormat.));
end;
run;
proc print data=want;
run;发布于 2015-02-27 07:49:20
使用@jaamor的数据,您可以在基r中这样做。
x <- c('Fructose','Maltose','Sucrose','Sucrose')
x <- c('Fructose','Maltose','Sucrose','Sucrose','Sucrose','Pasta',
'Pasta','Rice','Rice','Rice','Quinoa','Bread')
y <- gsub('a', '', letters[ave(seq_along(x), x, FUN = seq_along)])
data.frame(x = x, y = paste0(cumsum(!duplicated(x)), y))
# x y
# 1 Fructose 1
# 2 Maltose 2
# 3 Sucrose 3
# 4 Sucrose 3b
# 5 Sucrose 3c
# 6 Pasta 4
# 7 Pasta 4b
# 8 Rice 5
# 9 Rice 5b
# 10 Rice 5c
# 11 Quinoa 6
# 12 Bread 7https://stackoverflow.com/questions/28753120
复制相似问题