我正在尝试使用python中的scrapy模块来抓取细节,但我目前正试图让分页爬虫工作。我得到了部分正确的输出,但正如我前面说过的,它不是从网站上的以下页面中刮来的
import scrapy
from time import sleep
from ..items import SunwayscrapyItem
class SunwaySpider(scrapy.Spider):
name = "sunway"
page_number = 20
allowed_domains = ['https://www.sunwaymedical.com/find-a-doctor/']
start_urls = [
'https://www.sunwaymedical.com/find-a-doctor/search/0/?
specialty=&doctor=&name='
]
def parse(self, response):
# all_details = response.css('.col-lg-9')
# for details in all_details:
for SunwaySpider.page_number in range(0, 220, 20):
items = SunwayscrapyItem()
next_page = "https://www.sunwaymedical.com/find-a-doctor/search/" + str(
SunwaySpider.page_number) + "/?specialty=&doctor=&name="
if SunwaySpider.page_number < 220:
name = response.css('.doctor_name a::text').extract()
specialty = response.css('.doc_label3:nth-child(4)::text').extract()
language = response.css('.doc_label3:nth-child(8)::text').extract()
gender = response.css('.doc_label3:nth-child(12)::text').extract()
qualifications = response.css('.doc_label3:nth-child(16)::text').extract()
location = response.css('.doc_label3:nth-child(20)::text').extract()
contact = response.css('.doc_label3 a::text').extract()
items['Name'] = name
items['Specialty'] = list(map(str.strip, specialty))
items['Languages'] = list(map(str.strip, language))
items['Gender'] = list(map(str.strip, gender))
items['Qualifications'] = list(map(str.strip, qualifications))
items['Location'] = list(map(str.strip, location))
items['Contact'] = list(map(str.strip, contact))
yield items
sleep(3)
yield response.follow(next_page, callback=self.parse)发布于 2019-09-17 06:50:49
您没有正确地创建分页结构。不建议在单一方法中实现分页和项目的生成。看看下面的示例代码:
class AnswersMicrosoft(CrawlSpider):
name = 'answersmicrosoft'
allowed_domains = ['answers.microsoft.com']
start_urls = ['https://answers.microsoft.com/en-us']
listings_css = ['#categoryListGridMed', '.nav-links']
products_css = ['#threads .c-card .thread-title']
rules = (
Rule(LinkExtractor(restrict_css=products_css), callback='parse_item'),
Rule(LinkExtractor(restrict_css=listings_css), callback='parse_pagination'),
)
def parse_pagination(self, response):
forum_id_css = '#currentForumId::attr(value)'
forum_id = response.css(forum_id_css).get()
url = 'https://answers.microsoft.com/en-us/forum/forumthreadlist?forumId=' + forum_id
yield Request(url, callback=self.get_max_page, meta={'url': response.url})
def get_max_page(self, response):
max_page_css = '.currentMaxPage::attr(value)'
max_page = int(response.css(max_page_css).get())
url = response.url
for page in range(max_page):
updated_url = add_or_replace_parameter(url, 'page', page)
yield Request(updated_url, callback=self.parse)
def parse_item(self, response):
article = AnswersMicrosoftItem()
article["title"] = self.get_title(response).strip()
article["url"] = response.url
article["votes"] = self.get_votes(response)
article["replies"] = self.get_replies(response)
article["category"] = self.get_category(response)
article["views"] = self.get_views(response)
article["date"] = self.get_date(response).strip()
article["last_updated"] = self.get_last_updated(response).strip()
yield article确保实现了parse_pagination以及如何实现规则来调用该方法。如果你是新手,对规则不太了解,我宁愿你给他们看一看。他们将帮助你在您的旅程中很多前进。同时,尝试实现模块化方法。上面的rules只调用两件事情:如果他们看到一个产品,他们就调用parse_item,如果他们看到下一页,他们就调用parse_pagination。
我希望你能理解我的观点。祝你好运!
https://stackoverflow.com/questions/57966928
复制相似问题