假设我有一个可预测的文本文档,其中包含一些称为X:的in和已知的属性组合,例如具有已知实例数量的类别Y: (例如,在系列中的每个Y:之后总是只有1个X: ):
X:37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"我想检索所有蓝色物品的物品I列表。我不关心是否有重复的ID,只关心文档中有哪些ID值。然后,我想对列表进行排序,并与另一个具有完全相同结构的结构化文本文档中的蓝色事物ID列表进行比较(“两个文档中哪些蓝色事物是共同的?”“哪些蓝色的东西在文档1中而不在文档2中?”)。
我知道我可以很容易地对所有Y:BLUE行执行grep操作,但是我需要哪些额外的命令来查找每个这样的实例的前一个X:,并将排序后的结果列表传递给diff?自从AmiShell之后,我就没有密集地使用过命令行...对不起:-(网上有这类用例的食谱吗?
发布于 2018-06-04 15:37:42
假设您有以下两个输入文档:
$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
X:1
# more data pertaining to item 37
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:2
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:3
# more data pertaining to item 37
# more data pertaining to item 37
Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:4
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
X:4
# more data pertaining to item 37
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:3
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:2
# more data pertaining to item 37
# more data pertaining to item 37
Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:1
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"您可以在每个文档上使用以下awk命令来获取ID:
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4说明:
-F':'将needs/Y:BLUE$/{a[NR]=tmp}定义为字段-F':'会将it的值保存在tmp变量中(假设it只由数字组成,行上没有其他内容),如果不是这样,您可以调整过滤正则表达式以适应您的needs/Y:BLUE$/{a[NR]=tmp}当我们到达带有模式Y:BLUE的行时(假设: EOL紧跟在BLUE之后)我们将保存在tmp中的值添加到一个数组awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n中更改了awk命令
然后,您可以通过以下方式将它们组合在一起,找出两个文档在蓝色ids方面的差异:
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
2c2
< 2
---
> 3或者找到它们之间共有的蓝色in:
$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
1
4https://stackoverflow.com/questions/50670760
复制相似问题