首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用PUP / JQ将HTML转换为JSON并将数据提取到变量

使用PUP / JQ将HTML转换为JSON并将数据提取到变量
EN

Stack Overflow用户
提问于 2018-09-11 04:15:09
回答 1查看 2.6K关注 0票数 3

我有HTML里面的数据,我试图获得匹配。我正在使用bash来实现这一点,因为它不可能做到这一点,我正在运行超文本标记语言到PUP (正如这里在StackOverflow上推荐的那样),然后使用PUP我提取一些模式,但是我留下了大量的json和我不需要的数据,然后我运行sed命令来删除我不需要的行。我正在尝试找到一种方法,使用JQ只选择我需要的数据,这样我就不需要运行SED命令来删除不需要的行。

因此,我运行以下命令:-

代码语言:javascript
复制
cat test.html | pup 'div.scene json{}' > out.json

将生成以下代码。

代码语言:javascript
复制
 [
  {
   "children": [
    {
     "children": [
      {
       "class": "icon-new active",
       "tag": "div"
      },
      {
       "children": [
        {
         "children": [
          {
           "alt": "Album Title - Artist Name - 1",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 2",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 3",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 4",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 5",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "class": "last",
           "tag": "span"
          }
         ],
         "class": "sample-picker clearfix",
         "data-trackid": "bhangra-tracking-id",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "title": "Album Title"
        }
       ],
       "class": "card-overlay",
       "tag": "div"
      },
      {
       "children": [
       {
         "alt": "Album Title",
         "class": "lazy card-main-img",
         "data-src": "",
         "tag": "img",
         "title": "Album Title"
        }
       ],
       "data-trackid": "bhangra-tracking-id  ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "title": "Album Title"
      }
     ],
     "class": "card-image",
     "tag": "div"
    },
    {
     "children": [
      {
       "children": [
        {
         "data-trackid": "scene-card-info-title Album Title ",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "text": "Album Title",
         "title": "Album Title"
        }
       ],
       "class": "scene-card-title",
       "tag": "div"
      },
      {
       "children": [
        {
         "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
         "href": "/bhangra/profile/view/2842847/artist-name/",
         "tag": "a",
         "text": "Artist Name",
         "title": "Artist Name"
        }
       ],
       "class": "model-names",
       "tag": "div"
      },
      {
       "tag": "time",
       "text": "September 08, 2018"
      },
      {
       "children": [
        {
         "children": [
          {
           "class": "label-left-box",
           "tag": "span",
           "text": "Website Name"
          },
          {
           "class": "label-text",
           "tag": "span",
           "text": "Website URL"
          }
         ],
         "class": "collection label-small",
         "data-trackid": "scene-card-collection",
         "href": "/bhangra/main/id/url/",
         "tag": "a",
         "title": "Website URL"
        },
        {
         "class": "label-hd ",
         "tag": "span"
        },
        {
         "children": [
          {
           "children": [
            {
             "class": "icons like-icon",
             "tag": "span"
            },
            {
             "class": "like-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "likes",
           "tag": "span"
          },
          {
           "children": [
            {
             "class": "icons dislike-icon",
             "tag": "span"
            },
            {
             "class": "dislike-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "dislikes",
           "tag": "span"
          }
         ],
         "class": "label-rating",
         "tag": "span"
        }
       ],
       "class": "bhangra-information",
       "tag": "div"
      }
     ],
     "class": "scene-card-info",
     "tag": "div"
    }
   ],
   "class": "bhangra-card scene ",
   "tag": "div"
  }
 ]

然后,我使用JQ返回一些我想要的详细信息。

代码语言:javascript
复制
 cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'

这是返回下面的内容。

代码语言:javascript
复制
 {
   "1": [
     {
       "data-trackid": "scene-card-info-title Album Title ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "text": "Album Title",
       "title": "Album Title"
     }
   ],
   "2": [
     {
       "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
       "href": "/bhangra/profile/view/2842847/artist-name/",
       "tag": "a",
       "text": "Artist Name",
       "title": "Artist Name"
     }
   ],
   "date": "September 08, 2018"
 }

为了解决这个问题,我运行了一堆sed命令来删除上面代码中不需要的行。

下面是我希望从最初的jq查询中返回的内容,但我只是不确定如何获得返回的特定数据。

代码语言:javascript
复制
 { 
   "1" : {
            "album": "Album Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist Name",
            "date": "September 08, 2018"
   },
   "2" : {
            "album": "Album1 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist1 Name",
            "date": "September 08, 2018"
   },
   "3" : {
            "album": "Album2 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist2 Name",
            "date": "September 09, 2018"
   }
 }

更新编辑2018-09-11

因此,我在这方面取得了一些进展,使用下面的查询,我设法拉回了我需要的数据,但它们仍然是单独的数组。

代码语言:javascript
复制
 cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'

下面的输出让我更接近于我想要的东西(在上一个例子中)。

代码语言:javascript
复制
 {
   "WTF": "Album Title",
   "href": "/bhangra/album/view/2842847/title-of-album/",
   "text": "Album Title"
 }
   "WTF": "Artist Name",
   "href": "/bhangra/profile/view/2842847/artist-name/",
   "text": "Artist Name"
 }
 {
   "WTF": "Null",
   "href": "Null",
   "text": "September 08, 2018"
 }
EN

回答 1

Stack Overflow用户

发布于 2018-09-11 23:55:56

输入JSON和JSON之间的连接似乎很脆弱,但解决使用按顺序编号的键标记对象的问题的一种方法是使用以下函数:

代码语言:javascript
复制
def tag(s):
  reduce s as $x ({n:0, o:{}} ;
    .n += 1
    | .o += { (.n|tostring): $x})
  | .o;

这里,s应该是一个JSON实体流,结果是一个带有键"1“、"2”等的对象。

因此,现在的任务是生成所需对象的流。由于不清楚你想要的是什么,下面可以作为说明性说明。

代码语言:javascript
复制
{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(.. 
      | objects
      | select(has("title") and (has("children")|not) and .title == "Album Title")
      + $date )

输出

代码语言:javascript
复制
{
  "1": {
    "alt": "Album Title - Artist Name - 1",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "2": {
    "alt": "Album Title - Artist Name - 2",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "3": {
    "alt": "Album Title - Artist Name - 3",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "4": {
    "alt": "Album Title - Artist Name - 4",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "5": {
    "alt": "Album Title - Artist Name - 5",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "6": {
    "alt": "Album Title",
    "class": "lazy card-main-img",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "7": {
    "data-trackid": "scene-card-info-title Album Title ",
    "href": "/bhangra/album/view/2842847/title-of-album/",
    "tag": "a",
    "text": "Album Title",
    "title": "Album Title",
    "date": "September 08, 2018"
  }
}
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52264863

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档