文章/答案/技术大牛

发布

社区首页 >问答首页 >刮擦荣誉rel=nofollow

问刮擦荣誉rel=nofollow
EN

Stack Overflow用户

提问于 2014-01-27 21:38:14

回答 2查看 1.4K关注 0票数 1

刮擦可以忽略rel="nofollow"链接吗？看一下sgml.py in scrapy 0.22，它看起来是这样的：

我如何启用它？

python

web-crawler

scrapy

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-06-23 14:34:16

保罗的位置，我就是这么做的

rules = (
# Extract all pages, follow links, call method 'parse_page' for response callback, before processing links call method links_processor
Rule(LinkExtractor(allow=('','/')),follow=True,callback='parse_page',process_links='links_processor'),

这是实际的函数(我是python的新手，我确信有一种更好的方法可以在不创建新列表的情况下从循环中删除项。

def links_processor(self,links): 
 # A hook into the links processing from an existing page, done in order to not follow "nofollow" links 
 ret_links = list()
 if links:
 for link in links:
 if not link.nofollow: ret_links.append(link)
 return ret_links

轻拿轻放。

票数 4

Stack Overflow用户

发布于 2015-09-26 10:22:37

伊塔玛·吉罗的回答是正确的。对于我自己的博客，我实现了一个CrawlSpider，它使用基于LinkExtractor的规则从我的博客页面中提取所有相关链接：

# -*- coding: utf-8 -*-

'''
*   This program is free software: you can redistribute it and/or modify
*   it under the terms of the GNU General Public License as published by
*   the Free Software Foundation, either version 3 of the License, or
*   (at your option) any later version.
*
*   This program is distributed in the hope that it will be useful,
*   but WITHOUT ANY WARRANTY; without even the implied warranty of
*   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
*   GNU General Public License for more details.
*
*   You should have received a copy of the GNU General Public License
*   along with this program.  If not, see <http://www.gnu.org/licenses/>.
*
*   @author Marcel Lange <info@ask-sheldon.com>
*   @package ScrapyCrawler 
 '''


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

import Crawler.settings
from Crawler.items import PageCrawlerItem


class SheldonSpider(CrawlSpider):
    name = Crawler.settings.CRAWLER_NAME
    allowed_domains = Crawler.settings.CRAWLER_DOMAINS
    start_urls = Crawler.settings.CRAWLER_START_URLS
    rules = (
        Rule(
            LinkExtractor(
                allow_domains=Crawler.settings.CRAWLER_DOMAINS,
                allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
                deny=Crawler.settings.CRAWLER_DENY_REGEX,
                restrict_css=Crawler.settings.CSS_SELECTORS,
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback='parse_item',
            process_links='filter_links'
        ),
    )

    # Filter links with the nofollow attribute
    def filter_links(self, links):
        return_links = list()
        if links:
            for link in links:
                if not link.nofollow:
                    return_links.append(link)
                else:
                    self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
        return return_links

    def parse_item(self, response):
        # self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
        item = PageCrawlerItem()
        item['status'] = response.status
        item['title'] = response.xpath('//title/text()')[0].extract()
        item['url'] = response.url
        item['headers'] = response.headers
        return item

在https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/上，我详细描述了如何实现一个网站爬虫来预热我的Wordpress全文缓存。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21392222

复制

相似问题

问刮擦荣誉rel=nofollow
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮擦荣誉rel=nofollowEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问刮擦荣誉rel=nofollow
EN