我已经建立了一个爬虫在一个固定的领域内的螃蟹和提取url匹配的修复正则表达式。如果看到某个特定的url,则爬虫将跟踪该链接。爬虫可以很好地提取url,但是每次我运行爬虫时,它都会返回不同数量的链接,也就是说,每次运行它时,链接的数量都会不同。我在用Scrapy爬行。这是刮痕的问题吗?守则是:
class MySpider(CrawlSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
start_urls = ["http://www.xyz.nl/Vacancies"]
rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)
def parse_item(self, response):
outputfile = open('urllist.txt','a')
print response.url
outputfile.write(response.url+'\n')发布于 2014-04-07 12:54:43
不要在a方法中手动编写链接并以parse_item()模式打开文件,而是使用scrapy的内置物品出口商。使用链接的字段定义项:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
url = Field()
class MySpider(CrawlSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
start_urls = ["http://www.xyz.nl/Vacancies"]
rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)
def parse_item(self, response):
item = MyItem()
item['url'] = response.url
yield itemhttps://stackoverflow.com/questions/22912259
复制相似问题