文章/答案/技术大牛

发布

社区首页 >问答首页 >从命令行快速查找结构化文本数据？

问从命令行快速查找结构化文本数据？
EN

Stack Overflow用户

提问于 2018-06-04 04:39:21

回答 1查看 54关注 0票数 0

假设我有一个可预测的文本文档，其中包含一些称为X:的in和已知的属性组合，例如具有已知实例数量的类别Y: (例如，在系列中的每个Y:之后总是只有1个X: )：

  X:37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

我想检索所有蓝色物品的物品I列表。我不关心是否有重复的ID，只关心文档中有哪些ID值。然后，我想对列表进行排序，并与另一个具有完全相同结构的结构化文本文档中的蓝色事物ID列表进行比较(“两个文档中哪些蓝色事物是共同的？”“哪些蓝色的东西在文档1中而不在文档2中？”)。

我知道我可以很容易地对所有Y:BLUE行执行grep操作，但是我需要哪些额外的命令来查找每个这样的实例的前一个X:，并将排序后的结果列表传递给diff？自从AmiShell之后，我就没有密集地使用过命令行...对不起:-(网上有这类用例的食谱吗？

shell

awk

data-structures

grep

回答 1

Stack Overflow用户

发布于 2018-06-04 15:37:42

假设您有以下两个输入文档：

$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

您可以在每个文档上使用以下awk命令来获取ID：

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4

说明：

-F':'将needs
/Y:BLUE$/{a[NR]=tmp}定义为字段
- -F':'会将it的值保存在tmp变量中(假设it只由数字组成，行上没有其他内容)，如果不是这样，您可以调整过滤正则表达式以适应您的needs
- /Y:BLUE$/{a[NR]=tmp}当我们到达带有模式Y:BLUE的行时(假设: EOL紧跟在BLUE之后)我们将保存在tmp中的值添加到一个数组
- 在处理结束时，我们对数组进行排序并打印它。请注意，您在awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n
中更改了awk命令

然后，您可以通过以下方式将它们组合在一起，找出两个文档在蓝色ids方面的差异：

$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

或者找到它们之间共有的蓝色in：

$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50670760

复制

相似问题

问从命令行快速查找结构化文本数据？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从命令行快速查找结构化文本数据？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从命令行快速查找结构化文本数据？
EN