文章/答案/技术大牛

发布

问无法获得<span></span>文本
EN

Stack Overflow用户

提问于 2018-06-01 08:10:01

回答 3查看 166关注 0票数 2

无法在“表”中获取span文本，谢谢！

from bs4 import BeautifulSoup
import urllib2

url1 = "url"

content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table.find_all('span',recursive=False)
for row in rows:
    print(row.text)

python

beautifulsoup

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-06-01 08:41:36

table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})

在上面的行中，findAll()返回一个列表。因此，在下一行中，您将得到错误，因为它需要一个HTML字符串。

如果只期望一个表，请使用以下代码。只需替换

rows = table.find_all('span',recursive=False)

使用

rows = table[0].find_all('span')

如果希望页面中有多个表，请在表上运行一个for循环，然后在for循环中运行其余的语句。

此外，对于漂亮的输出，可以使用空格替换tabs，如下代码所示：

row = row.get_text()
row = row.replace('\t', '')
print(row)

您的最终工作代码是：

from bs4 import BeautifulSoup
import urllib2

url1 = "url"

content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table[0].find_all('span')
for row in rows:
    row_str = row.get_text()
    row_str = row_str.replace('\t', '')
    print(row_str)

对于recursive=False参数，如果它被设置为false，它只会在直接子元素中找到，在您的情况下，不会给出结果。

find()中的递归参数

如果你只想要漂亮汤考虑直接的孩子，你可以通过recursive=False

票数 2

Stack Overflow用户

发布于 2018-06-01 08:30:18

您似乎使用python2.x，下面是python 3.x解决方案，因为我目前还没有python2.x环境：

from bs4 import BeautifulSoup
import urllib.request as urllib


url1 = "<URL>"

# Read the HTML page
content1 = urllib.urlopen(url1).read()
soup = BeautifulSoup(content1, "lxml")

# Find the div (there is only one, so you do not need findAll) -> this is your problem
div = soup.find("div", class_="iw_component", id="c1417094965154")
# Now you retrieve all the span within this div
rows = div.find_all("span")

# You can do what you want with it !
line = ""
for row in rows:
    row_str = row.get_text()
    row_str = row_str.replace('\t', '')
    line += row_str + ", "
print(line)

票数 1

Stack Overflow用户

发布于 2018-06-01 08:46:52

下面是另一种使用lxml而不是漂亮汤的方法：

import requests
from lxml import html

req = requests.get("<URL>")
raw_html = html.fromstring(req.text)
spans = raw_html.xpath('//div[@id="c1417094965154"]//span/text()')
print("".join([x.replace("\t", "").replace("\r\n","").strip() for x in spans]))

输出: Kranji Mile Day联播比赛，Kranji马场，SINClass 3 Handicap -1200米TURFSaturday，2008年5月26日比赛1，5:15

正如您所看到的，输出需要一些格式，span是所有跨文本的列表，因此您可以进行任何需要的处理。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50638894

复制

相似问题

问无法获得<span></span>文本
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法获得<span></span>文本EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法获得<span></span>文本
EN