我正在使用BeautifulSoup解析一个被刮过的网页,而且和往常一样,页面的常规格式也有一些奇怪的例外。
到目前为止,我拥有的是一个表,我已经将所有的行都输入到rows中,所有的列都进入了cols (其中包含了所有的<td>s),然后我从元素中获得了纯文本以供以后使用。
这看起来像是:
soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]
for row in rows:
cols = row.findAll('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
return data问题是,有时其中一个<td>包含几个<li>,我希望用\n来替换它们,现在,使用ele的.text属性删除了所有的标记,包括<li>。
我的问题是:是否有可能以只保留特定标记的方式使用.text?我知道我可以先把ele转换成字符串,但是我不能让漂亮的汤自动删除所有其他丑陋的标签。
下面是一个<td>包含<li>的html示例:
<td> November General Election Day.Scheduled Elections:
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li>
<li>County Offices</li>
<li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
</td>现在,我的代码输出:
u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'我希望它看起来更像:
u'November General Election Day.Scheduled Elections:\nFederal, Statewide, Legislative and Judicial Offices\nCounty Offices\nInitiatives and Constitutional Amendments, if applicable'发布于 2014-09-16 13:27:43
我仍然不知道这个问题背后的动机是什么,但我的想法是。
查找所有的li标记,并在内容的开头找到一个新的行字符insert() .
工作示例(我向td添加了一些其他标记以演示该行为):
from bs4 import BeautifulSoup
data = """
<td> November General Election Day.Scheduled Elections:
<b>My Test String </b>
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
<p>New Paragraph</p>
</td>
"""
soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
element.insert(0, '\n')
print soup.td.text指纹:
November General Election Day.Scheduled Elections:
My Test String
Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable
New Paragraph下面是如何在您的情况下应用该解决方案:
from bs4 import BeautifulSoup
html = """
<table class="election">
<tr>
<td> November General Election Day.Scheduled Elections:
<b>My Test String </b>
<ul class="vtips">
<li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
</ul>
<p>New Paragraph</p>
</td>
</tr>
</table>
"""
soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")
data = []
for row in rows:
for element in row.select('td li'):
element.insert(0, '\n')
data.append([ele.text.strip() for ele in row('td')])
print datahttps://stackoverflow.com/questions/25869533
复制相似问题