首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >无法让分页爬虫运行Python3

无法让分页爬虫运行Python3
EN

Stack Overflow用户
提问于 2019-09-17 03:24:03
回答 1查看 43关注 0票数 0

我正在尝试使用python中的scrapy模块来抓取细节,但我目前正试图让分页爬虫工作。我得到了部分正确的输出,但正如我前面说过的,它不是从网站上的以下页面中刮来的

代码语言:javascript
复制
import scrapy
from time import sleep
from ..items import SunwayscrapyItem


class SunwaySpider(scrapy.Spider):
    name = "sunway"
    page_number = 20
    allowed_domains = ['https://www.sunwaymedical.com/find-a-doctor/']
    start_urls = [
        'https://www.sunwaymedical.com/find-a-doctor/search/0/? 
specialty=&doctor=&name=' 
    ]

def parse(self, response):
    # all_details = response.css('.col-lg-9')
    # for details in all_details:
    for SunwaySpider.page_number in range(0, 220, 20):
        items = SunwayscrapyItem()
        next_page = "https://www.sunwaymedical.com/find-a-doctor/search/" + str(
            SunwaySpider.page_number) + "/?specialty=&doctor=&name="
        if SunwaySpider.page_number < 220:
            name = response.css('.doctor_name a::text').extract()
            specialty = response.css('.doc_label3:nth-child(4)::text').extract()
            language = response.css('.doc_label3:nth-child(8)::text').extract()
            gender = response.css('.doc_label3:nth-child(12)::text').extract()
            qualifications = response.css('.doc_label3:nth-child(16)::text').extract()
            location = response.css('.doc_label3:nth-child(20)::text').extract()
            contact = response.css('.doc_label3 a::text').extract()

            items['Name'] = name
            items['Specialty'] = list(map(str.strip, specialty))
            items['Languages'] = list(map(str.strip, language))
            items['Gender'] = list(map(str.strip, gender))
            items['Qualifications'] = list(map(str.strip, qualifications))
            items['Location'] = list(map(str.strip, location))
            items['Contact'] = list(map(str.strip, contact))
            yield items
        sleep(3)
        yield response.follow(next_page, callback=self.parse)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-09-17 06:50:49

您没有正确地创建分页结构。不建议在单一方法中实现分页和项目的生成。看看下面的示例代码:

代码语言:javascript
复制
class AnswersMicrosoft(CrawlSpider):
name = 'answersmicrosoft'
allowed_domains = ['answers.microsoft.com']
start_urls = ['https://answers.microsoft.com/en-us']

listings_css = ['#categoryListGridMed', '.nav-links']
products_css = ['#threads .c-card .thread-title']

rules = (
    Rule(LinkExtractor(restrict_css=products_css), callback='parse_item'),
    Rule(LinkExtractor(restrict_css=listings_css), callback='parse_pagination'),

)

def parse_pagination(self, response):
    forum_id_css = '#currentForumId::attr(value)'
    forum_id = response.css(forum_id_css).get()

    url = 'https://answers.microsoft.com/en-us/forum/forumthreadlist?forumId=' + forum_id
    yield Request(url, callback=self.get_max_page, meta={'url': response.url})

def get_max_page(self, response):
    max_page_css = '.currentMaxPage::attr(value)'
    max_page = int(response.css(max_page_css).get())
    url = response.url

    for page in range(max_page):
        updated_url = add_or_replace_parameter(url, 'page', page)
        yield Request(updated_url, callback=self.parse)

def parse_item(self, response):
    article = AnswersMicrosoftItem()
    article["title"] = self.get_title(response).strip()
    article["url"] = response.url
    article["votes"] = self.get_votes(response)
    article["replies"] = self.get_replies(response)
    article["category"] = self.get_category(response)
    article["views"] = self.get_views(response)
    article["date"] = self.get_date(response).strip()
    article["last_updated"] = self.get_last_updated(response).strip()
    yield article

确保实现了parse_pagination以及如何实现规则来调用该方法。如果你是新手,对规则不太了解,我宁愿你给他们看一看。他们将帮助你在您的旅程中很多前进。同时,尝试实现模块化方法。上面的rules只调用两件事情:如果他们看到一个产品,他们就调用parse_item,如果他们看到下一页,他们就调用parse_pagination

我希望你能理解我的观点。祝你好运!

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57966928

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档