我有一个html,如下所示:
<h1>Sanctuary Verses</h1>
<h2>Purpose and Importance of the Sanctuary</h2>
<p>Ps 73:17\nUntil I went into the sanctuary of God; [then] understood I their end.</p>
<p>...</p>
<h2>Some other title</h2>
<p>...</p>
<h3>sub-sub-title</h3>
<p>sub-sub-content</p>
<h2>Some different title</h2>
<p>...</p>...没有div或section标记对p标记进行分组。它可以很好地用于显示目的。我希望提取数据,以便获得所需的输出。
期望产出:
h标签应显示为标题,并根据其级别p标记应添加到h标签指定标题下的内容中。
期望产出:
{
"title": "Sanctuary Verses"
"contents": [
{"title": "Purpose and Importance of the Sanctuary"
"contents":["Ps 73:17\nUntil I went into the sanctuary of God; [then] understood I their end.",
"...."
]
},
{"title": "Some other title"
"contents": ["...",
{"title": "sub-sub-title"
"content": ["sub-sub-content"]
}
]
},
{"title": "Some different title"
"content": ["...","..."]
}
}我编写了一些变通的代码,帮助我获得了所需的输出。我想知道哪一种方法最容易得到想要的输出
发布于 2020-01-27 13:38:45
这是一个堆栈问题/图问题。叫它树吧。(或者文件之类的)
我想你的初生元组还可以改进。(文本、深度、类型)
stack = []
depth = 0
broken_value = -1
current = {"title":"root", "contents":[]}
for item in list_of_tuples:
if item[1]>depth:
#deeper
next = { "title":item[0], "contents":[] }
current["contents"].append(next)
stack.append(current)
current=next
depth = item[1]
elif item[1]<depth:
#shallower closes current gets previous level
while depth>item[1]:
prev = stack.pop()
depth = depth-1
current = {"title":item[0], "content":[]}
stack[-1].append(current)
depth=item[1]
else:
#same depth
if item[2]==broken_value:
#<p> element gets added to current level.
current['contents'].append(item[0])
else:
#<h> element gets added to parent of current.
current = {"title":item[0], "content":[]}
stack[-1]["contents"].append(current)
broken_value = item[2]这将创建一个任意深度图,假定深度增加1,但可以减少任意数目。
最好是在字典中记录深度,这样你就可以一次移动多个深度。而不只是“标题”和“内容”--也许是“标题”、“深度”和“内容”
解释
堆栈跟踪打开的元素,我们当前的元素是我们正在构建的元素。
如果我们发现的深度大于我们当前的深度,那么我们将当前元素放到堆栈上(它仍然是打开的),并开始处理下一个级别元素。
如果深度小于当前元素,则将当前元素和父元素关闭到相同的深度。
最后,如果它是相同的深度,我们决定它是一个'p‘元素,刚刚被添加,还是另一个'h’关闭电流,并启动一个新的电流。
发布于 2020-01-27 16:08:43
您可以在itertools.groupby中使用递归。
import itertools as it, re
def to_tree(d):
v, r = [list(b) for _, b in it.groupby(d, key=lambda x:not x[0])], []
for i in v:
if r and isinstance(r[-1], dict) and not r[-1]['content']:
r[-1]['content'] = to_tree([(j[4:], k) for j, k in i])
else:
for _, k in i:
r.append(re.sub('</*\w+\>', '', k) if not re.findall('^\<h', k) else {'title':re.sub('</*\w+\>', '', k), 'content':[]})
return rimport json
result = to_tree([((lambda x:'' if not x else x[0])(re.findall('^\s+', i)), re.sub('^\s+', '', i)) for i in filter(None, html.split('\n'))])
print(json.dumps(result[0], indent=4))输出:
{
"title": "Sanctuary Verses",
"content": [
{
"title": "Purpose and Importance of the Sanctuary",
"content": [
"Ps 73:17 Until I went into the sanctuary of God; [then] understood I their end.",
"..."
]
},
{
"title": "Some other title",
"content": [
"...",
{
"title": "sub-sub-title",
"content": [
"sub-sub-content"
]
}
]
},
{
"title": "Some different title",
"content": [
"..."
]
}
]
} https://stackoverflow.com/questions/59929011
复制相似问题