我正在尝试从以下bs4元素中提取一些数据(示例如下),特别是构建一个循环,从中提取所有公司名称(可能还有位置):
[<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <div class="field-content"><div class="wrapper hidden">
<p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisië<br/>blabla useless data<br/><hr/>
Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
</div>
<a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>]名字是“哈克工业...”字符串。
输出:两个列表,如
[Hak Industrial Services B.V., Hak Industrial Services B.V., Hak Industrial Services Middle East LLC, Hak Industrial Services SEA Sdn. Bhd., Hak Industrial Services USLLC]和
[Nederland, Nederland, Verenigde Arabische Emiraten, Maleisië, Verenigde Staten van Amerika]有人知道如何在bs4中做到这一点吗?
提前谢谢你,
发布于 2018-06-21 21:41:48
我最近不得不完成一个类似的目标。我构建了一个函数来解析电子邮件中的HTML。它是这样的;
from bs4 import BeautifulSoup as bs
def parser(data):
# this will parse the data from ticket and create a list.
html = data
parsed = bs(html, "lxml")
data = [line.strip() for line in parsed.stripped_strings]
print data传入HTML将得到如下输出;
[u'[', u'Nevenvestiging:', u'Hak Industrial Services B.V., Hoogeveen', u'Nederland', u'blabla useless data', u'Hak Industrial Services B.V., Nieuw Heeten', u'Nederland', u'blabla useless data', u'Hak Industrial Services Middle East LLC, Abu Dhabi', u'Verenigde Arabische Emiraten', u'blabla useless data', u'Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor', u'Maleisi\xeb', u'blabla useless data', u'Hak Industrial Services USLLC, Houston', u'Verenigde Staten van Amerika', u'blabla useless data', u'Toon nevenvestigingen', u']']你可以稍微重构一下,让它更像你正在寻找的东西,但我希望这能指引你朝着正确的方向前进。
发布于 2018-06-22 07:01:03
数据必须采用哪种格式?我试着稍微分析一下。
# coding: utf-8
from __future__ import unicode_literals
from bs4 import BeautifulSoup
from bs4 import NavigableString, Tag
html = """<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <
<p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisië<br/>blabla useless data<br/><hr/>
Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
</div>
<a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>"""
if __name__ == "__main__":
soup = BeautifulSoup(html, "lxml")
companies = []
for child in soup.find("div", class_ = "wrapper hidden").contents:
siblings = []
if isinstance(child, Tag):
if child.name == "hr":
previous = child.previous_sibling
if previous:
siblings.append(previous)
while previous:
if isinstance(previous, Tag) and previous.name != "hr" or isinstance(previous, NavigableString):
siblings.append(previous)
previous = previous.previous_sibling
else:
previous = False
print siblings[::-1]https://stackoverflow.com/questions/50969205
复制相似问题