首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从softpedia.com获取Scrapy下载安装程序

从softpedia.com获取Scrapy下载安装程序
EN

Stack Overflow用户
提问于 2013-11-04 18:52:53
回答 1查看 823关注 0票数 1

目前,我可以从softpedia.com获得无穷无尽的爬行链接(包括所需的安装程序链接,如a1keylogger.zip?item=33649-3&affiliate=22260)。

spider.py如下:

代码语言:javascript
复制
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
    """ Crawl through web sites you specify """
    name = "softpedia"

    # Stay within these domains when crawling
    allowed_domains = ["www.softpedia.com"]

    start_urls = [
    "http://win.softpedia.com/",]

    download_delay = 2

    # Add our callback which will be called for every found link
    rules = [
            Rule(SgmlLinkExtractor(), follow=True)
    ]

items.py、pipelines.py、settings.py是默认的,除了添加到settings.py的一行之外:

代码语言:javascript
复制
FILES_STORE = '/home/test/softpedia/downloads'

使用urllib2,我能够判断链接是否是安装程序,在本例中,我在content_type中获得了“应用程序”:

代码语言:javascript
复制
>>> import urllib2
>>> url = 'http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260'
>>> response = urllib2.urlopen(url)
>>> content_type = response.info().get('Content-Type')
>>> print content_type
application/zip

我的问题是,如何收集所需的安装程序链接,并下载到我的目标文件夹?提前感谢!

PS:

我现在找到了两种方法,但我无法让它们发挥作用:

1.https://stackoverflow.com/a/7169241/2092480,我按照这个答案向蜘蛛添加了以下代码:

代码语言:javascript
复制
def parse_installer(self, response):
    # extract links
    lx = SgmlLinkExtractor()  
    urls = lx.extract_links(response)
    for url in urls:
        yield Request(url, callback=self.save_installer)

def save_installer(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f: # or using wget
        f.write(response.body)

蜘蛛只是因为这些代码从来不存在,我没有下载的文件,有人能看到哪里出了问题吗?

2.https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ,当我在"file_urls“中提供预定义的链接时,这个方法本身就起作用了。但是如何设置scrapy来收集到"file_urls“的所有安装程序链接?另外,对于这样简单的任务,上面的方法应该足够了。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-11-22 08:08:23

我结合了前面提到的获得实际/镜像安装程序下载的两种方法,然后使用文件下载管道来执行实际的download.However,如果文件下载URL是动态/复杂的,它似乎不起作用,例如http://www.softpedia.com/dyn-postdownload.php?p=00000&t=0&i=1。但它适用于更简单的链接,例如http://www.ietf.org/rfc/rfc2616.txt

代码语言:javascript
复制
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.contrib.loader import XPathItemLoader
from scrapy import log
from datetime import datetime 
from scrapy.conf import settings
from myscraper.items import SoftpediaItem

class SoftpediaSpider(CrawlSpider):
name = "sosoftpedia"
allowed_domains = ["www.softpedia.com"]
start_urls = ['http://www.softpedia.com/get/Antivirus/']
rules = Rule(SgmlLinkExtractor(allow=('/get/', ),allow_domains=("www.softpedia.com"), restrict_xpaths=("//td[@class='padding_tlr15px']",)), callback='parse_links', follow=True,),


def parse_start_url(self, response):
    return self.parse_links(response)

def parse_links(self, response):
    print "PRODUCT DOWNLOAD PAGE: "+response.url
    hxs = HtmlXPathSelector(response)
    urls = hxs.select("//a[contains(@itemprop, 'downloadURL')]/@href").extract()
    for url in urls:
        item = SoftpediaItem()
        request =  Request(url=url, callback=self.parse_downloaddetail) 
        request.meta['item'] = item
        yield request

def parse_downloaddetail(self, response):
    item = response.meta['item']
    hxs = HtmlXPathSelector(response)
    item["file_urls"] = hxs.select('//p[@class="fontsize16"]/b/a/@href').extract() #["http://www.ietf.org/rfc/rfc2616.txt"]
    print "ACTUAL DOWNLOAD LINKS "+ hxs.select('//p[@class="fontsize16"]/b/a/@href').extract()[0]
    yield item
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/19774912

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档