文章/答案/技术大牛

发布

社区首页 >问答首页 >Scraperwiki - python -跳过表行

问Scraperwiki - python -跳过表行
EN

Stack Overflow用户

提问于 2014-05-14 22:10:05

回答 1查看 547关注 0票数 2

我正在尝试抓取一个表，该表使用TH作为前导列元素，并带有以下TD标记。问题是该表使用需要跳过的间歇分隔符，因为它们不包含TH标记。

这是表中的一个示例：

<tr><th scope="row">Availability (non-CRS):</th><td></td></tr>
<tr><td colspan="2" class="fieldDivider"><div>&nbsp;</div></td></tr>
<tr><th scope="row">Start Date:</th><td>01 Jun 2012</td></tr>
<tr><th scope="row">Expiry Date:</th><td>31 May 2015</td></tr>
<tr><th scope="row">Duration:</th><td>36 months</td></tr>
<tr><td colspan="2" class="fieldDivider"><div>&nbsp;</div></td></tr>
<tr><th scope="row">Total Value:</th><td>&pound;18,720,000<i>(estimated)</i></td></tr>

我在scraperwiki中使用python来收集数据，但是跳过冒犯的行有问题。

在没有任何条件的情况下，我的代码一到达没有TH标记的行就会停止，因此我目前正在使用if语句，以确保我只在行上执行刮取操作，没有一个不间断的空间，但是我的变量(数据)没有被定义，所以if语句没有正确执行。

这是我在教程之外编写的第一段代码，所以我希望答案非常简单，我只是不知道它是什么。

#!/usr/bin/env python

import scraperwiki
import requests
from bs4 import BeautifulSoup

base_url = 'http://www.londoncontractsregister.co.uk/public_crs/contracts/contract-048024/'

html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")

table = soup.findAll('table')
rows = table[0].findAll('tr')


for row in rows:
    th_cell = row.findAll('th')
    td_cell = row.findAll('td')
    if td_cell[0].get_text() == '&nbsp;':
        data = {
           'description' : th_cell[0].get_text(),
           'record' : td_cell[0].get_text()
        }

print data

python-2.7

web-scraping

scraperwiki

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-05-23 18:26:54

有点简单(可能有更好的方法)，但这是基于您的代码，并且似乎得到了我认为您想要的；如果不能的话，尝试获取数据并处理异常：

data = []

for row in rows:
    th_cell = row.findAll('th')
    td_cell = row.findAll('td')
    try:
        data.append({'description': th_cell[0].get_text(),
                     'record' : td_cell[0].get_text()})
    except IndexError:
        pass

for item in data:
    print data

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23666029

复制

相似问题

问Scraperwiki - python -跳过表行
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scraperwiki - python -跳过表行EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Scraperwiki - python -跳过表行
EN