文章/答案/技术大牛

发布

社区首页 >问答首页 >将HTML中的段落文本格式化为一行

问将HTML中的段落文本格式化为一行
EN

Stack Overflow用户

提问于 2019-05-04 07:39:29

回答 2查看 542关注 0票数 0

我尝试过使用传统的漂亮汤方法从html页面中提取文本。我遵循了another SO answer的代码。

import urllib
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

我能够正确地使用这对大多数页面的文本提取。但是，我在段落中的单词之间出现了新的一行，比如我提到的一些特定页面。

结果：

\nAt Orizon, we use our extensive consulting, management, technology and\nengineering capabilities to design, develop,\ntest, deploy, and sustain business and mission-critical solutions to government\nclients worldwide.\nBy using proven management and technology deployment\npractices, we enable our clients to respond faster to opportunities,\nachieve more from their operations, and ultimately exceed\ntheir mission requirements.\nWhere\nconverge\nTechnology & Innovation\n© Copyright 2019 Orizon Inc., All Rights Reserved.\n>'

结果在技术与工程、开发、\n测试等之间出现了一条新的线。

这些都是同一段落中的所有案文。

如果我们在html源代码中查看它，它是正确的：

<p>
            At Orizon, we use our extensive consulting, management, technology and 
            engineering capabilities to design, develop, 
        test, deploy, and sustain business and mission-critical solutions to government 
            clients worldwide. 
    </p>
    <p>
            By using proven management and technology deployment 
            practices, we enable our clients to respond faster to opportunities, 
            achieve more from their operations, and ultimately exceed 
            their mission requirements.
    </p>

原因是什么？我怎么才能准确地提取出来呢？

python

python-3.x

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

发布于 2019-05-04 08:32:57

与其分割每行文本，不如将文本拆分为每个HTML标记，因为对于每个段落和每个标题，您都希望删除文本中的断行。

可以通过迭代所有感兴趣的元素(我包括p、h2和h1，但可以扩展列表)来实现这一点，对于每个元素，去掉任何新行，然后在元素的末尾追加一条换行符，以便在下一个元素之前创建一个行中断。

下面是一个可行的实现：

import urllib.request
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# put text inside paragraphs and titles on a single line
for p in soup(['h1','h2','p']):
    p.string = " ".join(p.text.split()) + '\n'

text = soup.text
# remove duplicate newlines in the text
text = '\n\n'.join(x for x in text.splitlines() if x.strip())

print(text)

输出样本：

login

About Us

At Orizon, we use our extensive consulting, management, technology and engineering capabilities to design, develop, test, deploy, and sustain business and mission-critical solutions to government clients worldwide.

By using proven management and technology deployment practices, we enable our clients to respond faster to opportunities, achieve more from their operations, and ultimately exceed their mission requirements.

如果您不想在段落/标题之间出现空白，请使用：

text = '\n'.join(x for x in text.splitlines() if x.strip())

票数 2

Stack Overflow用户

发布于 2019-05-04 07:51:41

如果您只想要段落标签中的内容，那么尝试如下

paragraph = soup.find('p').getText()

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55980523

复制

相似问题

问将HTML中的段落文本格式化为一行
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将HTML中的段落文本格式化为一行EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将HTML中的段落文本格式化为一行
EN