我目前正在用Scrapy编写职位空缺刮板,以解析约3M的职位空缺项目。现在我在适当的地方,当蜘蛛工作,并成功地抓取项目并将其存储到postgreesql中,但问题是它做起来相当慢。在1小时内,我只存储了12000个职位空缺,所以我真的离300万个空缺还很远。问题是,最终我需要每天抓取和更新一次数据,而按照目前的性能,我需要一天以上的时间来解析所有数据。
我在数据收集方面是新手,所以我可能会做一些基本的错误,如果有人能帮助我,我将非常感激。
我的蜘蛛代码:
import scrapy
import urllib.request
from lxml import html
from ..items import JobItem
class AdzunaSpider(scrapy.Spider):
name = "adzuna"
start_urls = [
'https://www.adzuna.ru/search?loc=136073&pp=10'
]
def parse(self, response):
job_items = JobItem()
items = response.xpath("//div[@class='sr']/div[@class='a']")
def get_redirect(url):
response = urllib.request.urlopen(url)
response_code = response.read()
result = str(response_code, 'utf-8')
root = html.fromstring(result)
final_url = root.xpath('//p/a/@href')[0]
final_final_url = final_url.split('?utm', 1)[0]
return final_final_url
for item in items:
id = None
data_aid = item.xpath(".//@data-aid").get()
redirect = item.xpath(".//h2/a/@href").get()
url = get_redirect(redirect)
url_header = item.xpath(".//h2/a/strong/text()").get()
if item.xpath(".//p[@class='as']/@data-company-name").get() == None:
company = item.xpath(".//p[@class='as']/text()").get()
else:
company = item.xpath(".//p[@class='as']/@data-company-name").get()
loc = item.xpath(".//p/span[@class='loc']/text()").get()
text = item.xpath(".//p[@class='at']/span[@class='at_tr']/text()").get()
salary = item.xpath(".//p[@class='at']/span[@class='at_sl']/text()").get()
job_items['id'] = id
job_items['data_aid'] = data_aid
job_items['url'] = url
job_items['url_header'] = url_header
job_items['company'] = company
job_items['loc'] = loc
job_items['text'] = text
job_items['salary'] = salary
yield job_items
next_page = response.css("table.pg td:last-child ::attr('href')").get()
if next_page is not None:
yield response.follow(next_page, self.parse)发布于 2019-11-03 13:54:36
在表中使用索引并批量插入而不是在Request
meta在可能的情况下
CONCURRENT_ITEMS=100,将其设置为较高会降低性能。在settings.py中使用较少的中间件和Pipielines
https://stackoverflow.com/questions/58674325
复制相似问题