文章/答案/技术大牛

发布

社区首页 >问答首页 >加快从网上下载文件的处理速度

问加快从网上下载文件的处理速度
EN

Stack Overflow用户

提问于 2016-12-07 02:25:01

回答 2查看 1.9K关注 0票数 1

我正在编写一个程序，它在运行之前必须从网络上下载一堆文件，所以我创建了一个函数，它将下载所有文件并“初始化”名为init_program的程序，它的工作原理是通过几个具有urls的dicts运行到github上的gistfile。它提取urls并使用urllib2下载它们。我无法添加所有的文件，但是您可以通过克隆回购这里来尝试。下面是一个函数，它将从要点创建文件：

def init_program():
    """ Initialize the program and allow all the files to be downloaded
        This will take awhile to process, but I'm working on the processing
        speed """

    downloaded_wordlists = []  # Used to count the amount of items downloaded
    downloaded_rainbow_tables = []

    print("\n")
    banner("Initializing program and downloading files, this may take awhile..")
    print("\n")

    # INIT_FILE is a file that will contain "false" if the program is not initialized
    # And "true" if the program is initialized
    with open(INIT_FILE) as data: 
        if data.read() == "false": 
            for item in GIST_DICT_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} wordlists.. ".format(len(downloaded_wordlists) + 1, 
                                                                                  len(GIST_DICT_LINKS.keys())))
                sys.stdout.flush()
                new_wordlist = open("dicts/included_dicts/wordlists/{}.txt".format(item), "a+") 
                # Download the wordlists and save them into a file
                wordlist_data = urllib2.urlopen(GIST_DICT_LINKS[item])
                new_wordlist.write(wordlist_data.read())
                downloaded_wordlists.append(item + ".txt")
                new_wordlist.close()

            print("\n")
            banner("Done with wordlists, moving to rainbow tables..")
            print("\n")

            for table in GIST_RAINBOW_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} rainbow tables".format(len(downloaded_rainbow_tables) + 1, 
                                                                                    len(GIST_RAINBOW_LINKS.keys())))
                new_rainbowtable = open("dicts/included_dicts/rainbow_tables/{}.rtc".format(table))
                # Download the rainbow tables and save them into a file
                rainbow_data = urllib2.urlopen(GIST_RAINBOW_LINKS[table])
                new_rainbowtable.write(rainbow_data.read())
                downloaded_rainbow_tables.append(table + ".rtc")
                new_rainbowtable.close()

            open(data, "w").write("true").close()  # Will never be initialized again
        else:
            pass

    return downloaded_wordlists, downloaded_rainbow_tables

这是可行的，但是它非常慢，因为文件的大小，每个文件中至少有10万行。我如何加快这个功能，使它更快，更方便用户？

urllib2

gist

python

performance

python-2.7

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-12-07 03:24:57

几周前，我遇到了类似的情况，需要下载许多大型文件，但我发现的所有简单的纯Python解决方案在下载优化方面都不够好。因此，我为Linux和Unix找到了https://github.com/eribertomota/axel - Light命令行下载加速器。

什么是阿克塞尔？ Axel试图通过对一个文件使用多个连接来加速下载过程，类似于DownThemAll和其他著名的程序。它还可以使用多个镜像进行一次下载。使用Axel，您可以更快地从Internet上获取文件。因此，Axel可以将下载速度提高到60% (根据一些测试)。

Usage: axel [options] url1 [url2] [url...]

--max-speed=x       -s x    Specify maximum speed (bytes per second)
--num-connections=x -n x    Specify maximum number of connections
--output=f      -o f    Specify local output file
--search[=x]        -S [x]  Search for mirrors and download from x servers
--header=x      -H x    Add header string
--user-agent=x      -U x    Set user agent
--no-proxy      -N  Just don't use any proxy server
--quiet         -q  Leave stdout alone
--verbose       -v  More status information
--alternate     -a  Alternate progress indicator
--help          -h  This information
--version       -V  Version information

由于阿克塞尔是用C编写的，而且Python没有C扩展，所以我使用子过程模块在外部执行它，并且非常适合我。

你可以这样做：

cmd = ['/usr/local/bin/axel', '-n', str(n_connections), '-o',
               "{0}".format(filename), url]
process = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE)

您还可以解析每个下载的进度，解析stdout的输出。

    while True:
        line = process.stdout.readline()
        progress = YOUR_GREAT_REGEX.match(line).groups()
        ...

票数 1

Stack Overflow用户

发布于 2016-12-07 08:20:08

你在等待每一次下载时都会被阻塞。因此，总时间是每次下载往返时间的总和。您的代码可能会花费大量时间等待网络流量。改善这种情况的一种方法是，在等待每个响应时不要阻塞。你可以用几种方式做到这一点。例如，将每个请求传递给单独的线程(或进程)，或者使用事件循环和协同。阅读线程和异步模块。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41008390

复制

相似问题

问加快从网上下载文件的处理速度
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加快从网上下载文件的处理速度EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问加快从网上下载文件的处理速度
EN