文章/答案/技术大牛

发布

社区首页 >问答首页 >我可以在一个漂亮的汤标签中修改文本而不把它转换成字符串吗？

问我可以在一个漂亮的汤标签中修改文本而不把它转换成字符串吗？
EN

Stack Overflow用户

提问于 2014-09-16 13:02:33

回答 1查看 4.2K关注 0票数 3

我正在使用BeautifulSoup解析一个被刮过的网页，而且和往常一样，页面的常规格式也有一些奇怪的例外。

到目前为止，我拥有的是一个表，我已经将所有的行都输入到rows中，所有的列都进入了cols (其中包含了所有的<td>s)，然后我从元素中获得了纯文本以供以后使用。

这看起来像是：

soup = BeautifulSoup(html)
table = soup.find("table", {"class" : "election"})
rows = table.findAll("tr")
data =[]

for row in rows:
    cols = row.findAll('td')
    cols = [ele.text.strip() for ele in cols]

data.append([ele for ele in cols if ele]) # Get rid of empty values

return data

问题是，有时其中一个<td>包含几个<li>，我希望用\n来替换它们，现在，使用ele的.text属性删除了所有的标记，包括<li>。

我的问题是:是否有可能以只保留特定标记的方式使用.text？我知道我可以先把ele转换成字符串，但是我不能让漂亮的汤自动删除所有其他丑陋的标签。

下面是一个<td>包含<li>的html示例：

<td> November General Election Day.Scheduled Elections:
    <ul class="vtips">
        <li>Federal, Statewide, Legislative and Judicial Offices</li>
        <li>County Offices</li>
        <li>Initiatives and Constitutional Amendments, if applicable</li>
    </ul>
</td>

现在，我的代码输出：

u'November General Election Day.Scheduled Elections:Federal, Statewide, Legislative and Judicial OfficesCounty OfficesInitiatives and Constitutional Amendments, if applicable'

我希望它看起来更像：

u'November General Election Day.Scheduled Elections:\nFederal, Statewide, Legislative and Judicial Offices\nCounty Offices\nInitiatives and Constitutional Amendments, if applicable'

python

html

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-09-16 13:27:43

我仍然不知道这个问题背后的动机是什么，但我的想法是。

查找所有的li标记，并在内容的开头找到一个新的行字符insert() .

工作示例(我向td添加了一些其他标记以演示该行为)：

from bs4 import BeautifulSoup

data = """
<td> November General Election Day.Scheduled Elections:
    <b>My Test String </b>
    <ul class="vtips">
        <li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
    </ul>
    <p>New Paragraph</p>
</td>
"""

soup = BeautifulSoup(data, 'html.parser')
for element in soup.td.find_all('li'):
    element.insert(0, '\n')

print soup.td.text

指纹：

November General Election Day.Scheduled Elections:
    My Test String 


Federal, Statewide, Legislative and Judicial Offices
County Offices
Initiatives and Constitutional Amendments, if applicable

New Paragraph

下面是如何在您的情况下应用该解决方案：

from bs4 import BeautifulSoup

html = """
<table class="election">
    <tr>
        <td> November General Election Day.Scheduled Elections:
            <b>My Test String </b>
            <ul class="vtips">
                <li>Federal, Statewide, Legislative and Judicial Offices</li><li>County Offices</li><li>Initiatives and Constitutional Amendments, if applicable</li>
            </ul>
            <p>New Paragraph</p>
        </td>
    </tr>
</table>
"""

soup = BeautifulSoup(html)
table = soup.find("table", {"class": "election"})
rows = table.find_all("tr")

data = []
for row in rows:
    for element in row.select('td li'):
        element.insert(0, '\n')
    data.append([ele.text.strip() for ele in row('td')])

print data

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25869533

复制

相似问题

问我可以在一个漂亮的汤标签中修改文本而不把它转换成字符串吗？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我可以在一个漂亮的汤标签中修改文本而不把它转换成字符串吗？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我可以在一个漂亮的汤标签中修改文本而不把它转换成字符串吗？
EN