我有一个大型数据库(database.csv),其中包含以下格式的条目:
SOME_ID_NUMBER
Some delimited columns of data here
More delimited columns of data here
Tonsof delimited columns of data here
#########
SOME_ID_NUMBER_2
Other delimited columns of data here
Cool delimited columns of data here
Awesome delimited columns of data here
Extra delimited columns of data here
#########
OTHER_ID_NAMES
Lame delimited columns of data here
Boring delimited columns of data here
Okay delimited columns of data here
#########这些条目以条目名开头,然后是几行(不同数量)的分隔数据,最后以一行'#‘字符结束
我在另一个文件(patterns.csv)中也有一个很大的模式列表,其中包含如下条目:
Some_ID_NUMBER
OTHER_ID_NAMES
ID_NOT_IN_LIST我想从数据库文件中提取与模式文件中的模式匹配的条目。以下是使用上面的数据所需的示例输出。
SOME_ID_NUMBER
Some delimited columns of data here
More delimited columns of data here
Tonsof delimited columns of data here
#########
OTHER_ID_NAMES
Lame delimited columns of data here
Boring delimited columns of data here
Okay delimited columns of data here
#########或者更好的输出:
SOME_ID_NUMBER Some delimited columns of data here
SOME_ID_NUMBER More delimited columns of data here
SOME_ID_NUMBER Tonsof delimited columns of data here
OTHER_ID_NAMES Lame delimited columns of data here
OTHER_ID_NAMES Boring delimited columns of data here
OTHER_ID_NAMES Okay delimited columns of data here
ID_NOT_IN_LIST这是我的尝试:
while read line
do
awk -v start="$line" -v last="#" '/^"start"/,/^"last"/' database.csv >>matches.csv
done<patterns.csv发布于 2016-03-03 07:27:20
使用多字符RS和ENDFILE的GNU awk:
$ cat tst.awk
NR==FNR { patterns[toupper($0)]; next }
ENDFILE { RS=ORS="\n#########\n"; FS="\n" }
toupper($1) in patterns
$ gawk -f tst.awk patterns.csv database.csv
SOME_ID_NUMBER
Some delimited columns of data here
More delimited columns of data here
Tonsof delimited columns of data here
#########
OTHER_ID_NAMES
Lame delimited columns of data here
Boring delimited columns of data here
Okay delimited columns of data here
#########。
$ cat tst.awk
NR==FNR { patterns[toupper($0)]; next }
ENDFILE { RS="\n#########\n"; FS="\n" }
toupper($1) in patterns {
patterns[$1]++
for (i=2;i<=NF;i++) {
print $1, $i
}
}
END {
for (pat in patterns) {
if (patterns[pat] == 0) {
print pat
}
}
}
$ gawk -f tst.awk patterns.csv database.csv
SOME_ID_NUMBER Some delimited columns of data here
SOME_ID_NUMBER More delimited columns of data here
SOME_ID_NUMBER Tonsof delimited columns of data here
OTHER_ID_NAMES Lame delimited columns of data here
OTHER_ID_NAMES Boring delimited columns of data here
OTHER_ID_NAMES Okay delimited columns of data here
ID_NOT_IN_LIST如果你想再次编写一个外壳循环仅仅是为了操作文本,也可以参见https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice。
https://stackoverflow.com/questions/35760108
复制相似问题