当我试图抓取一个特定的网站时,我遇到了一个奇怪的问题。如果我使用basespider抓取一些页面,代码运行得很好,但如果我将代码更改为使用爬行器,爬行器完成时没有任何错误,但没有爬行。
下面这段代码运行良好
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(BaseSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
def parse(self, response):
print response.body
print "Inside parse Item"
return []下面这段代码不起作用
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(CrawlSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
rules = (
Rule(SgmlLinkExtractor(allow=()),
'parseItem',
follow=True,
),
)
def parseItem(self, response):
print response.body
print "Inside parse Item"
return []Scrapy运行的输出如下所示
scrapy crawl hushbabies
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats:
{'downloader/request_bytes': 634,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 44395,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965),
'scheduler/memory_enqueued': 2,
'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)}
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished)
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 27820032, 'memusage/startup': 27652096}将站点从hushbabies.com更改为others将使代码正常工作。
发布于 2012-07-24 05:20:10
似乎在SgmlLinkExtractor,sgmllib的底层SGML解析器中有一个问题。
下面的代码返回零个链接:
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> fetch('http://www.hushbabies.com/')
>>> len(SgmlLinkExtractor().extract_links(response))
0你可以尝试Slybot的另一个链接提取器,它依赖于Scraply:
>>> from slybot.linkextractor import LinkExtractor
>>> from scrapely.htmlpage import HtmlPage
>>> p = HtmlPage(body=response.body_as_unicode())
>>> sum(1 for _ in LinkExtractor().links_to_follow(p))
314https://stackoverflow.com/questions/11618641
复制相似问题