我尝试过使用传统的漂亮汤方法从html页面中提取文本。我遵循了another SO answer的代码。
import urllib
from bs4 import BeautifulSoup
url = "http://orizon-inc.com/about.aspx"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)我能够正确地使用这对大多数页面的文本提取。但是,我在段落中的单词之间出现了新的一行,比如我提到的一些特定页面。
结果:
\nAt Orizon, we use our extensive consulting, management, technology and\nengineering capabilities to design, develop,\ntest, deploy, and sustain business and mission-critical solutions to government\nclients worldwide.\nBy using proven management and technology deployment\npractices, we enable our clients to respond faster to opportunities,\nachieve more from their operations, and ultimately exceed\ntheir mission requirements.\nWhere\nconverge\nTechnology & Innovation\n© Copyright 2019 Orizon Inc., All Rights Reserved.\n>'结果在技术与工程、开发、\n测试等之间出现了一条新的线。
这些都是同一段落中的所有案文。
如果我们在html源代码中查看它,它是正确的:
<p>
At Orizon, we use our extensive consulting, management, technology and
engineering capabilities to design, develop,
test, deploy, and sustain business and mission-critical solutions to government
clients worldwide.
</p>
<p>
By using proven management and technology deployment
practices, we enable our clients to respond faster to opportunities,
achieve more from their operations, and ultimately exceed
their mission requirements.
</p>原因是什么?我怎么才能准确地提取出来呢?
发布于 2019-05-04 08:32:57
与其分割每行文本,不如将文本拆分为每个HTML标记,因为对于每个段落和每个标题,您都希望删除文本中的断行。
可以通过迭代所有感兴趣的元素(我包括p、h2和h1,但可以扩展列表)来实现这一点,对于每个元素,去掉任何新行,然后在元素的末尾追加一条换行符,以便在下一个元素之前创建一个行中断。
下面是一个可行的实现:
import urllib.request
from bs4 import BeautifulSoup
url = "http://orizon-inc.com/about.aspx"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# put text inside paragraphs and titles on a single line
for p in soup(['h1','h2','p']):
p.string = " ".join(p.text.split()) + '\n'
text = soup.text
# remove duplicate newlines in the text
text = '\n\n'.join(x for x in text.splitlines() if x.strip())
print(text)输出样本:
login
About Us
At Orizon, we use our extensive consulting, management, technology and engineering capabilities to design, develop, test, deploy, and sustain business and mission-critical solutions to government clients worldwide.
By using proven management and technology deployment practices, we enable our clients to respond faster to opportunities, achieve more from their operations, and ultimately exceed their mission requirements.如果您不想在段落/标题之间出现空白,请使用:
text = '\n'.join(x for x in text.splitlines() if x.strip())发布于 2019-05-04 07:51:41
如果您只想要段落标签中的内容,那么尝试如下
paragraph = soup.find('p').getText()https://stackoverflow.com/questions/55980523
复制相似问题