我启动了一个scrapy项目并创建了这个爬虫:
import scrapy
class CarSpider(scrapy.Spider):
name = 'Car_Scrape'
page_number = 2
start_urls = [
'https://www.finn.no/car/used/search.html?orgId=9117269&page=1'
]
def parse(self, response):
for quote in response.css('article.ads__unit'):
yield {
'title': quote.css('a.ads__unit__link::text').get(),
'img:url': quote.css('img.img-format__img::attr(src)').get(),
'link': quote.css('a.ads__unit__link::attr(href)').get(),
'model_year': int(quote.css('div.ads__unit__content__keys div:nth-child(1)::text').get()),
'mileage': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(2)::text').get())))),
'price': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(3)::text').get())))),
}问题是当我尝试运行crawl命令时:
scrapy crawl Car_Scrape -o data.json它只刮掉了23辆第一辆车。但是,当我在scrapy shell中为相同的url运行此命令时:
for quote in response.css('article.ads__unit'):
print(quote.css('a.ads__unit__link::text').get())我把整页都刮掉了。我希望在CarSpider类中获得相同的结果。我是不是做错了什么?如果有人可以检查他们是否得到相同的问题,或可能是我的项目是错误的。任何帮助都会得到grealy的支持。
发布于 2020-10-05 02:03:10
如果我尝试运行您的爬行器,我会得到26个条目,但是它会抛出一个错误:
2020-10-04 19:52:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.finn.no/car/used/search.html?orgId=9117269&page=1> (referer: None)
Traceback (most recent call last):
File "c:\program files\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "D:\Users\Ivan\Documents\Python\a.py", line 22, in parse
'price': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(3)::text').get())))),
ValueError: invalid literal for int() with base 10: ''查看页面,有问题的清单中包含您期望的价格的Solgt,但您的代码无法正确处理。
https://stackoverflow.com/questions/64194041
复制相似问题