我试图创建一个刮刀,为其产品刮一个网站。我决定从导航菜单中提取所有类别链接,然后跟踪它们并提取所有产品链接,稍后我将在parse_product函数中解析这些链接。但我不知道什么是最好的方法。我正在与以下的parse_menu链接和进一步的抽离产品链接挣扎。请批评我的代码。
class DiorSpider(CrawlSpider):
name = 'newdior'
allowed_domains = ['www.dior.com']
start_urls = ['https://www.dior.com/en_us/']
rules = (
Rule(LinkExtractor(allow=(r'^https?://www.dior.com/en_us',
)), callback='parse_menu'),
Rule(LinkExtractor(allow=(r'^https?://www.dior.com/en_us/products/.*',
)), callback='parse_product'),
)
def parse_menu(self, response):
menu = response.xpath('//a[@class="navigation-item-link"]').extract()
for item in menu:
link = re.compile(r'a class="navigation-item-link" href="([a-zA-Z0-9_/-]*)"').findall(item)
if link:
absolute_url = response.urljoin(link[0])
yield absolute_url
def parse_product(self, response):发布于 2019-01-03 16:25:53
class DiorSpider(Spider): #crawlspider is used mostly when you use Linkextractors.
name = 'newdior'
allowed_domains = ['www.dior.com']
start_urls = ['https://www.dior.com/en_us/']
#if you're going through nevigation bar, no need to add Rules.
def parse(self, response):
links = response.xpath('//a[@class="navigation-item-link"]/@href').extract() #here you can easily extract links
for link in links:
#link = re.compile(r'a class="navigation-item-link" href="([a-zA-Z0-9_/-]*)"').findall(item)
#links are extracted in xpath above.
absolute_url = response.urljoin(link)
yield Request(absolute_url, self.parse_product)
def parse_product(self, response):https://stackoverflow.com/questions/54025952
复制相似问题