我正在使用Scrapy来检索有关https://www.indiegogo.com上项目的信息。我想抓取所有的网页与网址格式www.indiegogo.com/projects/[NameOfProject]。然而,我不确定如何在抓取过程中到达所有这些页面。我找不到对所有/projects/页的链接进行硬编码的母版页。所有项目似乎都可以从https://www.indiegogo.com/explore访问(通过可见链接和搜索功能),但我不能确定返回所有页面的链接/搜索查询集。下面给出了我的爬虫代码。这些start_urls和规则大约有6000页,但我听说应该有接近10倍的数量。
关于带参数的filter_quick:使用的urls参数值来自Explore页面上的"Trending“、"Final Countdown”、"New This Week“和"Most Funded”链接,显然会错过不受欢迎和资金不足的项目。per_page url参数上没有最大值。
有什么建议吗?谢谢!
class IndiegogoSpider(CrawlSpider):
name = "indiegogo"
allowed_domains = ["indiegogo.com"]
start_urls = [
"https://www.indiegogo.com/sitemap",
"https://www.indiegogo.com/explore",
"http://go.indiegogo.com/blog/category/campaigns-2",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=countdown&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=new&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=most_funded&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=popular_all&per_page=50000"
]
rules = (
Rule(LinkExtractor(allow=('/explore?'))),
Rule(LinkExtractor(allow=('/campaigns-2/'))),
Rule(LinkExtractor(allow=('/projects/')), callback='parse_item'),
)
def parse_item(self, response):
[...]旁注:还有其他的网址格式www.indiegogo.com/projects/[NameOfProject]/[OtherStuff],它们要么重定向到所需的网址格式,要么在我试图在浏览器中加载它们时给出404错误。我假设Scrapy正在正确地处理重定向和空白页面,但将开放的方式来验证这一点。
发布于 2014-11-05 08:10:35
如果你有指向sitemap的链接,那么让Scrapy从那里获取页面并处理它们会更快。这将会像下面这样工作。
从scrapy.contrib.spiders导入SitemapSpider的
MySpider类(SitemapSpider):
Http://www.example.com/robots.txt‘sitemap_urls =’
//**您可以在sitemap_rules下设置提取URL的规则。
sitemap_rules = ('/shop/','parse_shop'),sitemap_follow =‘/sitemap_shop’定义parse_shop(自身,响应):pass # ...这里是剪贴店...
发布于 2014-11-07 02:55:56
尝试下面的代码,这将抓取站点,并仅抓取"indiegogo.com/projects/“
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from sitemap.items import myitem
class DmozSpider(CrawlSpider):
name = 'indiego'
allowed_domains = ['indiegogo.com']
start_urls = [
'http://indiegogo.com'
]
rules = (Rule(LinkExtractor(allow_domains=['indiegogo.com/projects/']), callback='parse_items', follow= True),)
def parse_items(self, response):
item = myitem()
item['link'] = response.request.url
item['title'] = response.xpath('//title').extract()
yield itemhttps://stackoverflow.com/questions/26745024
复制相似问题