首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从命令行快速查找结构化文本数据?

从命令行快速查找结构化文本数据?
EN

Stack Overflow用户
提问于 2018-06-04 04:39:21
回答 1查看 54关注 0票数 0

假设我有一个可预测的文本文档,其中包含一些称为X:的in和已知的属性组合,例如具有已知实例数量的类别Y: (例如,在系列中的每个Y:之后总是只有1个X: ):

代码语言:javascript
复制
  X:37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

我想检索所有蓝色物品的物品I列表。我不关心是否有重复的ID,只关心文档中有哪些ID值。然后,我想对列表进行排序,并与另一个具有完全相同结构的结构化文本文档中的蓝色事物ID列表进行比较(“两个文档中哪些蓝色事物是共同的?”“哪些蓝色的东西在文档1中而不在文档2中?”)。

我知道我可以很容易地对所有Y:BLUE行执行grep操作,但是我需要哪些额外的命令来查找每个这样的实例的前一个X:,并将排序后的结果列表传递给diff?自从AmiShell之后,我就没有密集地使用过命令行...对不起:-(网上有这类用例的食谱吗?

EN

回答 1

Stack Overflow用户

发布于 2018-06-04 15:37:42

假设您有以下两个输入文档:

代码语言:javascript
复制
$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

您可以在每个文档上使用以下awk命令来获取ID:

代码语言:javascript
复制
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4

说明:

  • -F':'将needs
  • /Y:BLUE$/{a[NR]=tmp}定义为字段
    • -F':'会将it的值保存在tmp变量中(假设it只由数字组成,行上没有其他内容),如果不是这样,您可以调整过滤正则表达式以适应您的needs
    • /Y:BLUE$/{a[NR]=tmp}当我们到达带有模式Y:BLUE的行时(假设: EOL紧跟在BLUE之后)我们将保存在tmp中的值添加到一个数组
    • 在处理结束时,我们对数组进行排序并打印它。请注意,您在awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n

    中更改了awk命令

然后,您可以通过以下方式将它们组合在一起,找出两个文档在蓝色ids方面的差异:

代码语言:javascript
复制
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

或者找到它们之间共有的蓝色in:

代码语言:javascript
复制
$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50670760

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档