首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >提取bs4元素中的文本

提取bs4元素中的文本
EN

Stack Overflow用户
提问于 2018-06-21 21:00:49
回答 2查看 98关注 0票数 2

我正在尝试从以下bs4元素中提取一些数据(示例如下),特别是构建一个循环,从中提取所有公司名称(可能还有位置):

代码语言:javascript
复制
    [<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <div class="field-content"><div class="wrapper hidden">
 <p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
 Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
 Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
 Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisië<br/>blabla useless data<br/><hr/>
 Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
 </div>
 <a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>]

名字是“哈克工业...”字符串。

输出:两个列表,如

代码语言:javascript
复制
[Hak Industrial Services B.V., Hak Industrial Services B.V., Hak Industrial Services Middle East LLC, Hak Industrial Services SEA Sdn. Bhd., Hak Industrial Services USLLC]

代码语言:javascript
复制
[Nederland, Nederland, Verenigde Arabische Emiraten, Maleisië, Verenigde Staten van Amerika]

有人知道如何在bs4中做到这一点吗?

提前谢谢你,

EN

回答 2

Stack Overflow用户

发布于 2018-06-21 21:41:48

我最近不得不完成一个类似的目标。我构建了一个函数来解析电子邮件中的HTML。它是这样的;

代码语言:javascript
复制
from bs4 import BeautifulSoup as bs

def parser(data):
    # this will parse the data from ticket and create a list.
    html = data
    parsed = bs(html, "lxml")
    data = [line.strip() for line in parsed.stripped_strings]
    print data

传入HTML将得到如下输出;

代码语言:javascript
复制
[u'[', u'Nevenvestiging:', u'Hak Industrial Services B.V., Hoogeveen', u'Nederland', u'blabla useless data', u'Hak Industrial Services B.V., Nieuw Heeten', u'Nederland', u'blabla useless data', u'Hak Industrial Services Middle East LLC, Abu Dhabi', u'Verenigde Arabische Emiraten', u'blabla useless data', u'Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor', u'Maleisi\xeb', u'blabla useless data', u'Hak Industrial Services USLLC, Houston', u'Verenigde Staten van Amerika', u'blabla useless data', u'Toon nevenvestigingen', u']']

你可以稍微重构一下,让它更像你正在寻找的东西,但我希望这能指引你朝着正确的方向前进。

票数 0
EN

Stack Overflow用户

发布于 2018-06-22 07:01:03

数据必须采用哪种格式?我试着稍微分析一下。

代码语言:javascript
复制
# coding: utf-8
from __future__ import unicode_literals
from bs4 import BeautifulSoup
from bs4 import NavigableString, Tag

html = """<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <
 <p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
  Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
   Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
    Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisië<br/>blabla useless data<br/><hr/>
     Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
      </div>
       <a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>"""

if __name__ == "__main__":
    soup = BeautifulSoup(html, "lxml")
    companies = []
    for child in soup.find("div", class_ = "wrapper hidden").contents:
        siblings = []
        if isinstance(child, Tag):
            if child.name == "hr":
                previous = child.previous_sibling
                if previous:
                    siblings.append(previous)
                while previous:
                     if isinstance(previous, Tag) and previous.name != "hr" or isinstance(previous, NavigableString):
                         siblings.append(previous)
                         previous = previous.previous_sibling
                     else:
                         previous = False


                print siblings[::-1]
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50969205

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档