文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用我的CrawlSpider将相对路径转换为绝对路径？

问如何使用我的CrawlSpider将相对路径转换为绝对路径？
EN

Stack Overflow用户

提问于 2017-11-11 22:53:26

回答 1查看 473关注 0票数 0

目前我的CrawlSpider代码是：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class HiddenAnswersSpider(CrawlSpider):
    name = 'ha'
    start_urls = ['http://answerstedhctbek.onion/questions']
    allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
    rules = (
            Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
            Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')

            )

 def makeAbsolutePath(links):
    for i in range(links):
          links[i] = links[i].replace("../","")
    return links

因为论坛使用相对路径，所以我尝试创建一个自定义的process_links来删除"../“，但是当我运行我的代码时，我仍然收到：

2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)

正如你所看到的，由于错误的路径，我仍然收到400个错误。为什么我的代码没有从链接中删除"../“？

谢谢!

python

scrapy

web-crawler

回答 1

Stack Overflow用户

发布于 2017-11-12 00:52:44

问题可能是makeAbsolutePaths不是爬行器类的一部分。The documentation states

process_links是一个可调用的，也可以是一个字符串(在这种情况下，将使用来自具有该名称的爬行器对象的方法)

您没有在makeAbsolutePaths中使用self，所以我认为这不是一个缩进错误。makeAbsolutePaths还有其他一些错误。如果我们将代码更正为这种状态：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class HiddenAnswersSpider(CrawlSpider):
    name = 'ha'
    start_urls = ['file:///home/user/testscrapy/test.html']
    allowed_domains = []
    rules = (
            Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'),
            )

    def makeAbsolutePath(self, links):
        print(links)
        for i in range(links):
            links[i] = links[i].replace("../","")
        return links

它将产生以下错误：

TypeError: 'list' object cannot be interpreted as an integer

这是因为在对range的调用中没有使用对len()的调用，并且range只能对整数进行操作。它想要一个数字，并会给出从0到这个数字减1的范围。

修复此问题后，它将显示以下错误：

AttributeError: 'Link' object has no attribute 'replace'

这是因为与您想象的不同，links不是包含href=""属性内容的字符串列表。相反，它是Link对象的列表。

我建议您在makeAbsolutePath中输出links的内容，并查看是否需要执行任何操作。在我看来，scrapy一旦到达域名级别，就应该停止解析..操作符，所以你的链接应该指向http://answerstedhctbek.onion/<number>/<title>，即使网站使用的是..操作符，而没有实际的文件夹级别(因为URL是/questions而不是/questions/)。

不知何故，就像这样：

    def makeAbsolutePath(self, links):
        for i in range(len(links)):
            print(links[i].url)

        return []

(在这里返回一个空列表的好处是爬行器将停止，您可以检查控制台输出)

如果您随后发现URL实际上是错误的，您可以通过url属性对它们执行一些工作：

links[i].url = 'http://example.com'

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47239217

复制

相似问题

问如何使用我的CrawlSpider将相对路径转换为绝对路径？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用我的CrawlSpider将相对路径转换为绝对路径？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用我的CrawlSpider将相对路径转换为绝对路径？
EN