我刚接触scrapy,我的一只蜘蛛给我带来了麻烦。我需要一些帮助来找出我代码中的错误所在。我在遵循一些URL之后循环一个表,循环获取所有行,但只获取第一行的数据。
这是我的代码:
def parse(self, response):
Caballo = response.url
jockey_url = response.xpath(
'.//*[@id="site-content"]/div/main/div/div[1]/div[1]/div/div/div/div[1]/div[2]/div[2]/div/div[1]/ul/li[4]/a/@href').get()
loader = ItemLoader(item=DailyItem(), response=response)
loader.add_value('Caballo', Caballo)
loader.add_xpath('jockey', './/*[@id="site-content"]/div/main/div/div[1]/div[1]/div/div/div/div[1]/div[2]/div[2]/div/div[1]/ul/li[4]/a/text()')
new_items = loader.load_item()
yield response.follow(jockey_url, self.parse_jockey, meta={'item': new_items})
def parse_jockey(self, response):
new_items = response.meta['item']
table = response.xpath('//*[@id="tab-form-alltime"]/div/table/tbody/tr')
for t in table:
loader = ItemLoader(item=new_items, selector=t)
loader.add_xpath('Type', './/td[1]/text()')
loader.add_xpath('Rate', './/td[6]/text()')
yield loader.load_item()这是其中一个Urls的输出,如您所见,有3行,这很好,但上面的数据是相同的:
{"Caballo": "https://www.attheraces.com/form/horse/Alexanderthegreat/FR/3022995?raceid=1149928", "jockey": "Jason Hart", "Type": "Flat Turf", "Rate": "11.57%"},
{"Caballo": "https://www.attheraces.com/form/horse/Alexanderthegreat/FR/3022995?raceid=1149928", "jockey": "Jason Hart", "Type": "Flat Turf", "Rate": "11.57%"},
{"Caballo": "https://www.attheraces.com/form/horse/Alexanderthegreat/FR/3022995?raceid=1149928", "jockey": "Jason Hart", "Type": "Flat Turf", "Rate": "11.57%"},这是存储桶必须获取的内容:
发布于 2020-07-15 23:35:08
所以我认为问题的关键在于您的XPATH。
代码示例test.py
import scrapy
from ..items import DailyItem
from scrapy.loader import ItemLoader
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.attheraces.com']
start_urls = ['https://www.attheraces.com/form/jockey/Jason-Hart/1354728?raceid=1149928']
def parse(self, response):
Caballo = response.url
jockey_url = 'https://www.attheraces.com/form/horse/Strongbowe/FR/3091730?raceid=1150331'
loader = ItemLoader(item=DailyItem(), response=response)
loader.add_value('Caballo', Caballo)
loader.add_xpath('Jockey', '//h1[@class="h3"]/text()')
new_items = loader.load_item()
yield response.follow(jockey_url, self.parse_jockey, meta={'item':new_items})
def parse_jockey(self, response):
table = response.xpath('//div[@id="tab-form-flat-form"]/div[2]/table/tbody/tr')
new_items = response.meta['item']
for t in table:
loader = ItemLoader(item=new_items, selector=t)
if t.xpath('.//td[1]/div/span[2]/text()'):
loader.add_xpath('Type', './/td[1]/div/span[2]/text()')
loader.add_xpath('Rate', './/td[6]/text()')
yield loader.load_item()
else:
continue代码示例items.py
import scrapy
from scrapy.loader.processors import TakeFirst
class DailyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
Caballo = scrapy.Field(output_processor=TakeFirst())
Jockey = scrapy.Field(output_processor=TakeFirst())
Type = scrapy.Field(output_processor=TakeFirst())
Rate = scrapy.Field(output_processor=TakeFirst())输出
{"Caballo": "https://www.attheraces.com/form/jockey/Jason-Hart/1354728?raceid=1149928", "Jockey": "Jason Hart", "Type": "Turf", "Rate": "50.0%"}提示
如果您要提取的标记有
2.对于长属性名,请使用XPATH中的contains函数,该函数将获取包含您指定的任何内容的任何属性。
例如
'//div[contains(@class,"jock")]'将抓取任何类属性中包含jock的div。
在编写代码之前,
https://stackoverflow.com/questions/62917794
复制相似问题