文章/答案/技术大牛

发布

社区首页 >问答首页 >Beautiful Soup的extract()出错

问Beautiful Soup的extract()出错
EN

Stack Overflow用户

提问于 2009-05-12 22:00:30

回答 2查看 2.1K关注 0票数 0

我正在开发一些屏幕抓取软件，遇到了美汤的问题。我使用的是python 2.4.3和Beautiful Soup 3.0.7a。

我需要删除一个<hr>标记，但它可以有许多不同的属性，所以简单的replace()调用不会删除它。

给定以下html：

<h1>foo</h1>
<h2><hr/>bar</h2>

和以下代码：

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i
    print i.string

输出为：

<h1>foo</h1>
foo
<h2>bar</h2>
None

是我误解了提取函数，还是这是Beautiful Soup的一个bug？

python

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2009-05-12 23:58:13

这可能是个bug。但幸运的是，还有另一种获取字符串的方法：

from BeautifulSoup import BeautifulSoup

string = \
"""<h1>foo</h1>
<h2><hr/>bar</h2>"""

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i, i.next

# <h1>foo</h1> foo
# <h2>bar</h2> bar

票数 2

Stack Overflow用户

发布于 2016-04-22 21:23:06

我也有同样的问题。我不知道为什么，但我猜这与BS创建的空元素有关。

例如，如果我有以下代码：

from bs4 import BeautifulSoup

html ='            \
<a>                \
    <b test="help">            \
        hello there!  \
        <d>        \
        now what?  \
        </d>    \
        <e>        \
            <f>        \
            </f>    \
        </e>    \
    </b>        \
    <c>            \
    </c>        \
</a>            \
'

soup = BeautifulSoup(html,'lxml')
#print(soup.find('b').attrs)

print(soup.find('b').contents)

t = soup.find('b').findAll()
#t.reverse()
for c in t:
    gb = c.extract()

print(soup.find('b').contents)

soup.find('b').text.strip()

我得到了以下错误：

“NoneType”对象没有属性“next_element”

在我得到的第一张照片上：

>>> print(soup.find('b').contents)
[u' ', <d> </d>, u' ', <e> <f> </f> </e>, u' ']

在第二秒，我得到了：

>>> print(soup.find('b').contents)
[u' ', u' ', u' ']

我很确定是中间的空元素造成了问题。

我发现的一个变通方法是重新创建汤：

soup = BeautifulSoup(str(soup))
soup.find('b').text.strip()

现在它会打印：

>>> soup.find('b').text.strip()
u'hello there!'

我希望这能有所帮助。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/855087

复制

相似问题

问Beautiful Soup的extract()出错
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Beautiful Soup的extract()出错EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Beautiful Soup的extract()出错
EN