目前我的CrawlSpider代码是:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['http://answerstedhctbek.onion/questions']
allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
rules = (
Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')
)
def makeAbsolutePath(links):
for i in range(links):
links[i] = links[i].replace("../","")
return links因为论坛使用相对路径,所以我尝试创建一个自定义的process_links来删除"../“,但是当我运行我的代码时,我仍然收到:
2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)正如你所看到的,由于错误的路径,我仍然收到400个错误。为什么我的代码没有从链接中删除"../“?
谢谢!
发布于 2017-11-12 00:52:44
问题可能是makeAbsolutePaths不是爬行器类的一部分。The documentation states
process_links是一个可调用的,也可以是一个字符串(在这种情况下,将使用来自具有该名称的爬行器对象的方法)
您没有在makeAbsolutePaths中使用self,所以我认为这不是一个缩进错误。makeAbsolutePaths还有其他一些错误。如果我们将代码更正为这种状态:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class HiddenAnswersSpider(CrawlSpider):
name = 'ha'
start_urls = ['file:///home/user/testscrapy/test.html']
allowed_domains = []
rules = (
Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'),
)
def makeAbsolutePath(self, links):
print(links)
for i in range(links):
links[i] = links[i].replace("../","")
return links它将产生以下错误:
TypeError: 'list' object cannot be interpreted as an integer这是因为在对range的调用中没有使用对len()的调用,并且range只能对整数进行操作。它想要一个数字,并会给出从0到这个数字减1的范围。
修复此问题后,它将显示以下错误:
AttributeError: 'Link' object has no attribute 'replace'这是因为与您想象的不同,links不是包含href=""属性内容的字符串列表。相反,它是Link对象的列表。
我建议您在makeAbsolutePath中输出links的内容,并查看是否需要执行任何操作。在我看来,scrapy一旦到达域名级别,就应该停止解析..操作符,所以你的链接应该指向http://answerstedhctbek.onion/<number>/<title>,即使网站使用的是..操作符,而没有实际的文件夹级别(因为URL是/questions而不是/questions/)。
不知何故,就像这样:
def makeAbsolutePath(self, links):
for i in range(len(links)):
print(links[i].url)
return [](在这里返回一个空列表的好处是爬行器将停止,您可以检查控制台输出)
如果您随后发现URL实际上是错误的,您可以通过url属性对它们执行一些工作:
links[i].url = 'http://example.com'https://stackoverflow.com/questions/47239217
复制相似问题