文章/答案/技术大牛

发布

社区首页 >问答首页 >Python3:如何基于‘h’标记级别将普通html转换为嵌套字典？

问Python3:如何基于‘h’标记级别将普通html转换为嵌套字典？
EN

Stack Overflow用户

提问于 2020-01-27 10:13:41

回答 2查看 293关注 0票数 1

我有一个html，如下所示：

<h1>Sanctuary Verses</h1>
    <h2>Purpose and Importance of the Sanctuary</h2>
       <p>Ps 73:17\nUntil I went into the sanctuary of God; [then] understood I their end.</p>
       <p>...</p>
    <h2>Some other title</h2>
        <p>...</p>
         <h3>sub-sub-title</h3>
             <p>sub-sub-content</p>
    <h2>Some different title</h2>
        <p>...</p>...

没有div或section标记对p标记进行分组。它可以很好地用于显示目的。我希望提取数据，以便获得所需的输出。

期望产出：

h标签应显示为标题，并根据其级别
嵌套，p标记应添加到h标签

指定标题下的内容中。

期望产出：

{
  "title": "Sanctuary Verses"
  "contents": [
    {"title": "Purpose and Importance of the Sanctuary"
     "contents":["Ps 73:17\nUntil I went into the sanctuary of God; [then] understood I their end.",
                 "...."
                ]
    },
    {"title": "Some other title"
     "contents": ["...",
                 {"title": "sub-sub-title"
                  "content": ["sub-sub-content"]
                 }
                 ]
    },
    {"title": "Some different title"
     "content": ["...","..."]
    }
}

我编写了一些变通的代码，帮助我获得了所需的输出。我想知道哪一种方法最容易得到想要的输出

python

dictionary

beautifulsoup

nested

tuples

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-01-27 13:38:45

这是一个堆栈问题/图问题。叫它树吧。(或者文件之类的)

我想你的初生元组还可以改进。(文本、深度、类型)

stack = []
depth = 0
broken_value = -1
current = {"title":"root", "contents":[]}
for item in list_of_tuples:
    if item[1]>depth:
         #deeper
         next = { "title":item[0], "contents":[]  }
         current["contents"].append(next)
         stack.append(current)
         current=next
         depth = item[1]
    elif item[1]<depth:
         #shallower closes current gets previous level
         while depth>item[1]:
             prev = stack.pop()
             depth = depth-1
         current = {"title":item[0], "content":[]}
         stack[-1].append(current)
         depth=item[1]
    else:
         #same depth 
         if item[2]==broken_value:
             #<p> element gets added to current level.
             current['contents'].append(item[0])
         else:
             #<h> element gets added to parent of current.
             current = {"title":item[0], "content":[]}
             stack[-1]["contents"].append(current)
    broken_value = item[2]

这将创建一个任意深度图，假定深度增加1，但可以减少任意数目。

最好是在字典中记录深度，这样你就可以一次移动多个深度。而不只是“标题”和“内容”--也许是“标题”、“深度”和“内容”

解释

堆栈跟踪打开的元素，我们当前的元素是我们正在构建的元素。

如果我们发现的深度大于我们当前的深度，那么我们将当前元素放到堆栈上(它仍然是打开的)，并开始处理下一个级别元素。

如果深度小于当前元素，则将当前元素和父元素关闭到相同的深度。

最后，如果它是相同的深度，我们决定它是一个'p‘元素，刚刚被添加，还是另一个'h’关闭电流，并启动一个新的电流。

票数 1

Stack Overflow用户

发布于 2020-01-27 16:08:43

您可以在itertools.groupby中使用递归。

import itertools as it, re
def to_tree(d):
  v, r = [list(b) for _, b in it.groupby(d, key=lambda x:not x[0])], []
  for i in v:
    if r and isinstance(r[-1], dict) and not r[-1]['content']:
      r[-1]['content'] = to_tree([(j[4:], k) for j, k in i])
    else:
      for _, k in i:
        r.append(re.sub('</*\w+\>', '', k) if not re.findall('^\<h', k) else {'title':re.sub('</*\w+\>', '', k), 'content':[]})
  return r

import json
result = to_tree([((lambda x:'' if not x else x[0])(re.findall('^\s+', i)), re.sub('^\s+', '', i)) for i in filter(None, html.split('\n'))])
print(json.dumps(result[0], indent=4))

输出：

{
   "title": "Sanctuary Verses",
   "content": [
    {
        "title": "Purpose and Importance of the Sanctuary",
        "content": [
            "Ps 73:17 Until I went into the sanctuary of God; [then] understood I their end.",
            "..."
        ]
    },
    {
        "title": "Some other title",
        "content": [
            "...",
            {
                "title": "sub-sub-title",
                "content": [
                    "sub-sub-content"
                ]
            }
        ]
    },
      {
         "title": "Some different title",
         "content": [
            "..."
         ]
      }
   ] 
}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59929011

复制

相似问题

问Python3:如何基于‘h’标记级别将普通html转换为嵌套字典？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3:如何基于‘h’标记级别将普通html转换为嵌套字典？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3:如何基于‘h’标记级别将普通html转换为嵌套字典？
EN