首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >为什么下载文本文件不能正常工作?

为什么下载文本文件不能正常工作?
EN

Stack Overflow用户
提问于 2013-04-18 21:18:57
回答 1查看 747关注 0票数 1

我使用的是Python 3.3.1。我已经创建了一个名为download_file()的函数,用于下载文件并将其保存到磁盘。

代码语言:javascript
复制
#!/usr/bin/python3
# -*- coding: utf8 -*-

import datetime
import os
import urllib.error
import urllib.request


def download_file(*urls, download_location=os.getcwd(), debugging=False):
    """Downloads the files provided as multiple url arguments.

    Provide the url for files to be downloaded as strings. Separate the
    files to be downloaded by a comma.

    The function would download the files and save it in the folder
    provided as keyword-argument for download_location. If
    download_location is not provided, then the file would be saved in
    the current working directory. Folder for download_location would be
    created if it doesn't already exist. Do not worry about trailing
    slash at the end for download_location. The code would take carry of
    it for you.

    If the download encounters an error it would alert about it and
    provide the information about the Error Code and Error Reason (if
    received from the server).

    Normal Usage:
    >>> download_file('http://localhost/index.html',
                      'http://localhost/info.php')
    >>> download_file('http://localhost/index.html',
                      'http://localhost/info.php',
                      download_location='/home/aditya/Download/test')
    >>> download_file('http://localhost/index.html',
                      'http://localhost/info.php',
                      download_location='/home/aditya/Download/test/')

    In Debug Mode, files are not downloaded, neither there is any
    attempt to establish the connection with the server. It just prints
    out the filename and its url that would have been attempted to be
    downloaded in Normal Mode.

    By Default, Debug Mode is inactive. In order to activate it, we
    need to supply a keyword-argument as 'debugging=True', like:
    >>> download_file('http://localhost/index.html',
                      'http://localhost/info.php',
                      debugging=True)
    >>> download_file('http://localhost/index.html',
                      'http://localhost/info.php',
                      download_location='/home/aditya/Download/test',
                      debugging=True)

    """
    # Append a trailing slash at the end of download_location if not
    # already present
    if download_location[-1] != '/':
        download_location = download_location + '/'

    # Create the folder for download_location if not already present
    os.makedirs(download_location, exist_ok=True)

    # Other variables
    time_format = '%Y-%b-%d %H:%M:%S'   # '2000-Jan-01 22:10:00'

    # "Request Headers" information for the file to be downloaded
    accept = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    accept_encoding = 'gzip, deflate'
    accept_language = 'en-US,en;q=0.5'
    connection = 'keep-alive'
    user_agent = 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) \
                  Gecko/20100101 Firefox/20.0'
    headers = {'Accept': accept,
               'Accept-Encoding': accept_encoding,
               'Accept-Language': accept_language,
               'Connection': connection,
               'User-Agent': user_agent,
               }

    # Loop through all the files to be downloaded
    for url in urls:
        filename = os.path.basename(url)
        if not debugging:
            try:
                request_sent = urllib.request.Request(url, None, headers)
                response_received = urllib.request.urlopen(request_sent)
            except urllib.error.URLError as error_encountered:
                print(datetime.datetime.now().strftime(time_format),
                      ':', filename, '- The file could not be downloaded.')
                if hasattr(error_encountered, 'code'):
                    print(' ' * 22, 'Error Code -', error_encountered.code)
                if hasattr(error_encountered, 'reason'):
                    print(' ' * 22, 'Reason -', error_encountered.reason)
            else:
                read_response = response_received.read()
                output_file = download_location + filename
                with open(output_file, 'wb') as downloaded_file:
                    downloaded_file.write(read_response)
                print(datetime.datetime.now().strftime(time_format),
                      ':', filename, '- Downloaded successfully.')
        else:
            print(datetime.datetime.now().strftime(time_format),
                  ': Debugging :', filename, 'would be downloaded from :\n',
                  ' ' * 21, url)

此功能适用于下载PDF,图像和其他格式,但它给像html文件这样的文本文档带来了麻烦。我怀疑问题与末尾的这一行有关:

代码语言:javascript
复制
with open(output_file, 'wb') as downloaded_file:

所以,我也尝试过在wt模式下打开它。我也尝试过只使用w模式。但这并不能解决问题。

另一个问题可能是编码,所以我还包括了第二行:

代码语言:javascript
复制
# -*- coding: utf8 -*-

但这仍然不起作用。可能的问题是什么?我如何使它同时适用于文本文件和二进制文件?

不起作用的示例:

代码语言:javascript
复制
>>>download_file("http://docs.python.org/3/tutorial/index.html")

当我在Gedit中打开它时,它显示为:

同样,当在Firefox中打开时:

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-04-18 21:46:10

您正在下载的文件是以gzip编码发送的--您可以看到,如果您使用zcat index.html,下载的文件将正确显示。在您的代码中,您可能希望添加如下内容:

代码语言:javascript
复制
if response_received.headers.get('Content-Encoding') == 'gzip':
    read_response = zlib.decompress(read_response, 16 + zlib.MAX_WBITS)

编辑:

好吧,我不能说为什么它在windows上工作(不幸的是,我没有Windows盒子来测试它),但是如果你发布一个响应的转储(即将响应对象转换为字符串),这可能会提供一些见解。大概服务器选择不使用gzip编码发送,但是考虑到这段代码对报头非常明确,我不确定会有什么不同。

值得一提的是,您的头文件明确指定允许使用gzip和deflate (请参阅accept_encoding)。如果您删除了该头文件,则在任何情况下都不必担心对响应进行解压缩。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/16084117

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档