我有HTML里面的数据,我试图获得匹配。我正在使用bash来实现这一点,因为它不可能做到这一点,我正在运行超文本标记语言到PUP (正如这里在StackOverflow上推荐的那样),然后使用PUP我提取一些模式,但是我留下了大量的json和我不需要的数据,然后我运行sed命令来删除我不需要的行。我正在尝试找到一种方法,使用JQ只选择我需要的数据,这样我就不需要运行SED命令来删除不需要的行。
因此,我运行以下命令:-
cat test.html | pup 'div.scene json{}' > out.json将生成以下代码。
[
{
"children": [
{
"children": [
{
"class": "icon-new active",
"tag": "div"
},
{
"children": [
{
"children": [
{
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"class": "last",
"tag": "span"
}
],
"class": "sample-picker clearfix",
"data-trackid": "bhangra-tracking-id",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-overlay",
"tag": "div"
},
{
"children": [
{
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title"
}
],
"data-trackid": "bhangra-tracking-id ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-image",
"tag": "div"
},
{
"children": [
{
"children": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"class": "scene-card-title",
"tag": "div"
},
{
"children": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"class": "model-names",
"tag": "div"
},
{
"tag": "time",
"text": "September 08, 2018"
},
{
"children": [
{
"children": [
{
"class": "label-left-box",
"tag": "span",
"text": "Website Name"
},
{
"class": "label-text",
"tag": "span",
"text": "Website URL"
}
],
"class": "collection label-small",
"data-trackid": "scene-card-collection",
"href": "/bhangra/main/id/url/",
"tag": "a",
"title": "Website URL"
},
{
"class": "label-hd ",
"tag": "span"
},
{
"children": [
{
"children": [
{
"class": "icons like-icon",
"tag": "span"
},
{
"class": "like-amount",
"tag": "var",
"text": "0"
}
],
"class": "likes",
"tag": "span"
},
{
"children": [
{
"class": "icons dislike-icon",
"tag": "span"
},
{
"class": "dislike-amount",
"tag": "var",
"text": "0"
}
],
"class": "dislikes",
"tag": "span"
}
],
"class": "label-rating",
"tag": "span"
}
],
"class": "bhangra-information",
"tag": "div"
}
],
"class": "scene-card-info",
"tag": "div"
}
],
"class": "bhangra-card scene ",
"tag": "div"
}
]然后,我使用JQ返回一些我想要的详细信息。
cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'这是返回下面的内容。
{
"1": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"2": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"date": "September 08, 2018"
}为了解决这个问题,我运行了一堆sed命令来删除上面代码中不需要的行。
下面是我希望从最初的jq查询中返回的内容,但我只是不确定如何获得返回的特定数据。
{
"1" : {
"album": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist Name",
"date": "September 08, 2018"
},
"2" : {
"album": "Album1 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist1 Name",
"date": "September 08, 2018"
},
"3" : {
"album": "Album2 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist2 Name",
"date": "September 09, 2018"
}
}更新编辑2018-09-11
因此,我在这方面取得了一些进展,使用下面的查询,我设法拉回了我需要的数据,但它们仍然是单独的数组。
cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'下面的输出让我更接近于我想要的东西(在上一个例子中)。
{
"WTF": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"text": "Album Title"
}
"WTF": "Artist Name",
"href": "/bhangra/profile/view/2842847/artist-name/",
"text": "Artist Name"
}
{
"WTF": "Null",
"href": "Null",
"text": "September 08, 2018"
}发布于 2018-09-11 23:55:56
输入JSON和JSON之间的连接似乎很脆弱,但解决使用按顺序编号的键标记对象的问题的一种方法是使用以下函数:
def tag(s):
reduce s as $x ({n:0, o:{}} ;
.n += 1
| .o += { (.n|tostring): $x})
| .o;这里,s应该是一个JSON实体流,结果是一个带有键"1“、"2”等的对象。
因此,现在的任务是生成所需对象的流。由于不清楚你想要的是什么,下面可以作为说明性说明。
{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(..
| objects
| select(has("title") and (has("children")|not) and .title == "Album Title")
+ $date )输出
{
"1": {
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"2": {
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"3": {
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"4": {
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"5": {
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"6": {
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title",
"date": "September 08, 2018"
},
"7": {
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title",
"date": "September 08, 2018"
}
}https://stackoverflow.com/questions/52264863
复制相似问题