首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从BeautifulSoup页面获取整洁的文本

从BeautifulSoup页面获取整洁的文本
EN

Stack Overflow用户
提问于 2021-07-02 05:34:50
回答 1查看 36关注 0票数 0

我正在做一个web刮刀来从URL中找到工作描述,下面是我现在的代码:

代码语言:javascript
复制
def getJobDesc(url):
    try:
        req = requests.get(url)
        page = BeautifulSoup(req.text, 'html.parser')
        jd = page.find("div", {"data-automation": "jobDescription"})
        return jd
    except:
        return ""

因此,它完成了它应该做的事情,并且测试URL中的jd如下所示:

代码语言:javascript
复制
<div class="vDEj0_0" data-automation="jobDescription"><span class="FYwKg _2Bz3E C6ZIU_0 _6ufcS_0 _2DNlq_0 _29m7__0"><div class="FYwKg"><p><strong>Job Responsibilities:</strong></p><ul><li><span style="color:black">Provide innovative solutions to complex business problems</span></li><li><span style="color:black">Plan, develop and implement large-scale projects from conception to completion</span></li><li><span style="color:black">Develop and architect lifecycle of projects working on different technologies and platforms</span></li><li><span style="color:black">Design, develop and implement new integration</span></li></ul><p><strong>Job Requirements:</strong></p><ul><li><span style="color:black">Proficient in Java and preferably in Python as well</span></li><li><span style="color:black">Basic understanding of database i.e MongoDB, MySQL databases is a plus</span></li><li><span style="color:black">Good understanding of </span><strong>Object-oriented</strong><span style="color:black"> programming</span></li><li><span style="color:black">Basic understanding in version control systems e.g. Git</span></li><li><span style="color:black">Basic understanding in Linux operating system</span></li><li><span style="color:black">Basic understanding of cloud services – Azure, AWS, etc</span></li><li><span style="color:black">Basic understanding of Devops</span></li><li><span style="color:black">A degree in Computer Science or equivalent industry experience</span></li><li><span style="color:black">Passionate with building elegant, scalable software that solves practical problems</span></li><li><span style="color:black">Team player and can do attitude</span></li><li><span style="color:black">Good problem solving skills and attention to detail</span></li></ul></div></span></div>

但是,当我将其更改为返回jd.text时,结果如下:

代码语言:javascript
复制
'Job Responsibilities:Provide innovative solutions to complex business problemsPlan, develop and implement large-scale projects from conception to completionDevelop and architect lifecycle of projects working on different technologies and platformsDesign, develop and implement new integrationJob Requirements:Proficient in Java and preferably in Python as wellBasic understanding of database i.e MongoDB, MySQL databases is a plusGood understanding of\xa0Object-oriented\xa0programmingBasic understanding in version control systems e.g. GitBasic understanding in Linux operating systemBasic understanding of cloud services – Azure, AWS, etcBasic understanding of DevopsA degree in Computer Science or equivalent industry experiencePassionate with building elegant, scalable software that solves practical problemsTeam player and can do attitudeGood problem solving skills and attention to detail'

所以我有两个问题:

没有转换correctly.

  • Formatted文本(本例中的单词Object-oriented )没有正确解析
  1. 列表。
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-07-02 05:41:57

您可以使用get_text()方法添加一个空格作为separator=参数,以便“不嵌套”文本。

因此,与其:

代码语言:javascript
复制
return jd.text

用途:

代码语言:javascript
复制
return jd.get_text(separator=" ")

您还可以使用:

代码语言:javascript
复制
jd.get_text(separator="\n")

将文本输出到单独的行中。

(注:我无法重现你的第二个问题,但看看能否解决)。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68220065

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档