文章/答案/技术大牛

发布

社区首页 >问答首页 >BeautifulSoup：< class>TEXT类<span类></span><span我想要</span>

问BeautifulSoup：< class>TEXT类<span类></span><span我想要</span>
EN

Stack Overflow用户

提问于 2013-07-12 19:10:13

回答 3查看 31.1K关注 0票数 5

我正在尝试使用BeautifulSoup提取包含在跨度内的字符串，并使用id="titleDescription“。

<div class="itemText">
    <div class="wrapper">
        <span class="itemPromo">Customer Choice Award Winner</span>
        <a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" >
            <span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz &#40;3.9GHz Turbo&#41; LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
            <span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz &#40;3.9GHz Turbo&#41; LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
        </a>
    </div>

代码片段

f = open('egg.data', 'rb')
content = f.read()
content = content.decode('utf-8', 'replace')
content = ''.join([x for x in content if ord(x) < 128])

soup = bs(content)

for itemText in soup.find_all('div', attrs={'class':'itemText'}):
    wrapper = itemText.div
    wrapper_href = wrapper.a
    for child in wrapper_href.descendants:
        if child['id'] == 'titleDescriptionID':
           print(child, "\n")

回溯错误：

Traceback (most recent call last):
  File "egg.py", line 66, in <module>
    if child['id'] == 'titleDescriptionID':
TypeError: string indices must be integers

python

回答 3

Stack Overflow用户

回答已采纳

发布于 2013-07-12 19:13:22

spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
    print span.string

在您的代码中，wrapper_href.descendants包含至少4个元素、2个span标记和由2个span标记括起来的2个字符串。它递归地搜索子对象。

票数 15

Stack Overflow用户

发布于 2013-07-12 19:14:42

wrapper_href.descendants包含任何NavigableString objects，这就是您正在绊倒的内容。NavigableString本质上是string对象，您可以尝试使用child['id']行对其进行索引：

>>> next(wrapper_href.descendants)
u'\n'

为什么不直接使用itemText.find('span', id='titleDescriptionID')加载标记呢

演示：

>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
...     print itemText.find('span', id='titleDescriptionID')
...     print itemText.find('span', id='titleDescriptionID').text
... 
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K

票数 2

Stack Overflow用户

发布于 2013-07-12 19:21:02

from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string

for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
    print item.string

当我们使用BeautifulSoup搜索标记时，我们得到一个BeautifulSoup.Tag对象，它可以直接用于访问它的其他属性，如内部内容、样式、href等。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17613606

复制

相似问题

问BeautifulSoup：< class>TEXT类<span类></span><span我想要</span>
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BeautifulSoup：< class>TEXT类<span类></span><span我想要</span>EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BeautifulSoup：< class>TEXT类<span类></span><span我想要</span>
EN