首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将HTML中的段落文本格式化为一行

将HTML中的段落文本格式化为一行
EN

Stack Overflow用户
提问于 2019-05-04 07:39:29
回答 2查看 542关注 0票数 0

我尝试过使用传统的漂亮汤方法从html页面中提取文本。我遵循了another SO answer的代码。

代码语言:javascript
复制
import urllib
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

我能够正确地使用这对大多数页面的文本提取。但是,我在段落中的单词之间出现了新的一行,比如我提到的一些特定页面。

结果:

代码语言:javascript
复制
\nAt Orizon, we use our extensive consulting, management, technology and\nengineering capabilities to design, develop,\ntest, deploy, and sustain business and mission-critical solutions to government\nclients worldwide.\nBy using proven management and technology deployment\npractices, we enable our clients to respond faster to opportunities,\nachieve more from their operations, and ultimately exceed\ntheir mission requirements.\nWhere\nconverge\nTechnology & Innovation\n© Copyright 2019 Orizon Inc., All Rights Reserved.\n>'

结果在技术与工程开发、\n测试等之间出现了一条新的线。

这些都是同一段落中的所有案文。

如果我们在html源代码中查看它,它是正确的:

代码语言:javascript
复制
<p>
            At Orizon, we use our extensive consulting, management, technology and 
            engineering capabilities to design, develop, 
        test, deploy, and sustain business and mission-critical solutions to government 
            clients worldwide. 
    </p>
    <p>
            By using proven management and technology deployment 
            practices, we enable our clients to respond faster to opportunities, 
            achieve more from their operations, and ultimately exceed 
            their mission requirements.
    </p>

原因是什么?我怎么才能准确地提取出来呢?

EN

回答 2

Stack Overflow用户

发布于 2019-05-04 08:32:57

与其分割每行文本,不如将文本拆分为每个HTML标记,因为对于每个段落和每个标题,您都希望删除文本中的断行。

可以通过迭代所有感兴趣的元素(我包括ph2h1,但可以扩展列表)来实现这一点,对于每个元素,去掉任何新行,然后在元素的末尾追加一条换行符,以便在下一个元素之前创建一个行中断。

下面是一个可行的实现:

代码语言:javascript
复制
import urllib.request
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# put text inside paragraphs and titles on a single line
for p in soup(['h1','h2','p']):
    p.string = " ".join(p.text.split()) + '\n'

text = soup.text
# remove duplicate newlines in the text
text = '\n\n'.join(x for x in text.splitlines() if x.strip())

print(text)

输出样本:

代码语言:javascript
复制
login

About Us

At Orizon, we use our extensive consulting, management, technology and engineering capabilities to design, develop, test, deploy, and sustain business and mission-critical solutions to government clients worldwide.

By using proven management and technology deployment practices, we enable our clients to respond faster to opportunities, achieve more from their operations, and ultimately exceed their mission requirements.

如果您不想在段落/标题之间出现空白,请使用:

代码语言:javascript
复制
text = '\n'.join(x for x in text.splitlines() if x.strip())
票数 2
EN

Stack Overflow用户

发布于 2019-05-04 07:51:41

如果您只想要段落标签中的内容,那么尝试如下

代码语言:javascript
复制
paragraph = soup.find('p').getText()
票数 -1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/55980523

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档