我有一个csv文件,格式如下:
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"我想在单行中按第一列的唯一id和连接类型进行分组,如下所示:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"我发现awk在处理这些场景方面做得很好。但我能做到的就是:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"我使用了这个命令:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file如何在删除重复项的同时处理第二列类型的格式设置?
发布于 2017-10-12 21:55:26
快速解决方案:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"仅当尚未看到line 时,
!seen[$0]++才为true如果第二列都应该在双引号内
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"发布于 2017-10-12 23:06:37
使用GNU awk实现真正的多维数组、gensub()和sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"输出的行和列都将按字符串升序排序(即按字符的字母顺序)。
发布于 2017-10-12 22:00:34
Short GNU datamash + tr解决方案:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'输出:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"-
如果应消除项目间的双引号,请使用以下替代方法:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'输出:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"https://stackoverflow.com/questions/46711259
复制相似问题