文章/答案/技术大牛

发布

社区首页 >问答首页 >LZW减压库

问LZW减压库
EN

Code Review用户

提问于 2018-03-16 13:32:12

回答 2查看 1.5K关注 0票数 6

我被要求为面试过程写一个解压缩程序库。是LZW解压缩库。我已经试过了。我正在寻找任何反馈的结构，演示，可读性从专业人士谁曾作为软件工程师。

LZW解释。解压缩部分中的第二个表解释了字典是如何构建的，在我的例子中，键大小被限制在12位。

"""
Decompressor library for Lempel–Ziv–Welch compression algorithm.
Main Functions:
    decompress(file_path)
    extract_archive_to_file(input_path, output_path)
Test code in the end.
"""


def read_compressed_codes_from_archive(file_path: str) -> [int]:
    """
    Reads file at specified path and returns content as an int array.
    :param file_path: path to archive
    :return: array of int codes
    """
    compressed_codes = []
    with open(file_path, "rb") as archive:
        # Read the 3 bytes, which contain two 12-bit codes
        bytes = archive.read(3)
        while bytes:
            if len(bytes) == 3:
                code1, code2 = unpack_codes_from_3_bytes(bytes)

                compressed_codes.append(code1)
                compressed_codes.append(code2)
            elif len(bytes) == 2:
                # Odd number of codes in the end of the file
                # Extract a single 12-bit code which is padded to 16-bits
                code = (0b0000111111111111 & (bytes[0] << 8)) | bytes[1]
                compressed_codes.append(code)

            bytes = archive.read(3)

    return compressed_codes


def unpack_codes_from_3_bytes(packed_bytes: [bytes]) -> (int, int):
    """
    Extracts two 12-bit numbers from 3 bytes (24-bits)
    :param packed_bytes: array of 3 int bytes
    :return: extracted int codes
    """
    # Create 16-bit int vars to store extracted codes in
    code1 = 0b0000111111111111
    code2 = 0b0000111111111111

    # Copy over 8 most significant bits and shift them to correct place
    code1 = code1 & packed_bytes[0] << 4
    # Copy over the 4 least significant bits from second byte
    code1 = code1 | (packed_bytes[1] >> 4)

    # Copy over the 4 most significant bits from second byte
    code2 = code2 & packed_bytes[1] << 8
    # Copy over the 8 least significant bits from third byte
    code2 = code2 | packed_bytes[2]

    return code1, code2


def convert_codes_to_text(compressed_codes: [int]) -> str:
    """
    Decompresses codes into text data.
    :param compressed_codes: compressed codes from archive
    :return: string data from compressed codes
    """
    # Create initial dictionary with individual char compressed_codes
    dictionary_size = 256
    dictionary = {i: chr(i) for i in range(dictionary_size)}

    text_data = ""
    previous_code = None

    # Loop over compressed compressed_codes and convert them to text values.
    for code in compressed_codes:
        if code in dictionary:
            text_data += dictionary[code]

            if previous_code:
                # Make sure new value is not already in the dict,
                # otherwise add it to the end.
                if dictionary[previous_code] + dictionary[code][0] not in \
                        dictionary.keys():
                    dictionary[dictionary_size] = dictionary[previous_code] \
                                                + dictionary[code][0]

                    dictionary_size += 1
                    # If dictionary is full, reset back to initial dict
                    # That is 4096 in binary ie one and 12 zeros
                    if dictionary_size == 0b1000000000000:
                        dictionary_size = 256
        else:
            assert ("Decompression error. Code {} not present in "
                    "dictionary".format(code))

        previous_code = code

    return text_data


def decompress(file_path: str) -> str:
    """
    Decompress an archive and get the text data.
    :param filepath: path to archive
    :return: text data as string
    """
    codes = read_compressed_codes_from_archive(file_path)
    text_data = convert_codes_to_text(codes)
    return text_data


def extract_archive_to_file(input_path: str, output_path: str):
    """
    Extracts an archive and saves it on disk.
    :param input_path: path to archive
    :param output_path: path to extracted file
    :return: none
    """
    text_data = decompress(input_path)
    with open(output_path, "w") as output_file:
        output_file.write(text_data)


def test_library():
    """
    Example test function.
    """
    file_path = "LzwInputData/compressedfile3.z"
    text_data = decompress(file_path)
    print(text_data)

    output_path = "LzwInputData/compressedfile3.txt"
    extract_archive_to_file(file_path, output_path)

test_library()

谢谢

python

interview-questions

回答 2

Code Review用户

回答已采纳

发布于 2018-03-16 13:59:04

一般来说，这看起来像是写得很好，文档完整，干净的python代码--所以总体来说很好！但如果我非要挑剔的话，有几件事我可以提.

code1, code2 = unpack_codes_from_3_bytes(bytes)
compressed_codes.append(code1)
compressed_codes.append(code2)

可缩短为：

compressed_codes.extend(unpack_codes_from_3_bytes(bytes))

我有点不确定你为什么这么做：

if dictionary_size == 0b1000000000000:

首先，将其定义为int (dictionary_size = 256)，然后惟一的修改是int加法(dictionary_size += 1)，那么为什么要进行二进制比较呢？

你能解释一下为什么要使用assert()吗？在这样的模块中，我通常倾向于显式地引发异常--甚至可以生成自定义的异常类，以便模块的最终用户能够轻松地捕获它们--或者如果这是用于报告，则使用logging.error或类似的。

很少有地方可以生成变量，并且只使用它们一次，例如：

text_data = convert_codes_to_text(codes)
return text_data

可以是：

return convert_codes_to_text(codes)

或者：

text_data = decompress(input_path)
with open(output_path, "w") as output_file:
    output_file.write(text_data)

可以是：

with open(output_path, "w") as output_file:
    output_file.write(decompress(input_path))

最后，not in dictionary.keys():可以是not in dictionary:

票数 3

Code Review用户

发布于 2018-03-19 21:41:12

首先，我建议您更改read_compressed_codes_from_archive的API，使其不是文件路径，而是一个类似文件的对象；请参见这个职位。如果有一天你不想只读一个文件，它会给你更多的自由。举个例子，我以前使用的一个应用程序(不是Python)最终遇到了一些问题，我们使用的第三方API只接受文件名，最后我们得到了错误报告，说它对大型文件崩溃了。事实证明，按照他们设计库的方式，他们只能一次加载整个文件来处理文件，而不是通过流或增量加载文件。显式地支持流或类似于启动的东西是处理它的好方法。

接下来，出于类似的原因，您应该让读者成为一个生成器。总的来说，我可能会重写代码，使其看起来像这样(注意，我不知道生成器的正确文档和类型语法是什么，所以我保留了原样)。

import typing

def read_compressed_codes_from_archive(archive: typing.BinaryIO) -> [int]:
    """
    Reads file at specified path and returns content as an int array.
    :param file_path: path to archive
    :return: array of int codes
    """

    bytes = archive.read(3)

    while bytes:
        if len(bytes) == 3:
            code1, code2 = unpack_codes_from_3_bytes(bytes)

            yield code1
            yield code2
        elif len(bytes) == 2:
            yield (TWELVE_BIT_MASK & (bytes[0] << 8)) | bytes[1]

        bytes = archive.read(3)

我还定义了一个全局常量，TWELVE_BIT_MASK = 0b0000111111111111，所以更清楚地知道在那里到底发生了什么。然后，我们可以简化位旋转一点；得到前12位数不需要任何掩蔽。从那里开始，我会将所有的数据都移动到unpack_codes_from_3_bytes中，并使用yield from来获取所有的值，不管在何种情况下。最后，我们可以编写一个函数来处理获取前12位和最后12位的每个用例，如下所示：

import typing

def read_compressed_codes_from_archive(archive: typing.BinaryIO) -> [int]
    bytes = archive.read(3)

    while bytes:
        yield from unpack_codes_from_3_bytes(bytes)
        bytes = archive.read(3)

def unpack_codes_from_3_bytes(packed_bytes: [bytes]) -> int:

    if len(bytes) == 3:
        yield get_first_twelve_bits(packed_bytes)
    yield get_last_twelve_bits(packed_bytes)

def get_first_twelve_bits(packed_bytes: [bytes]) -> int:
    return (packed_bytes[0] << 4) | (packed_bytes[1] >> 4)

def get_last_twelve_bits(packed_bytes: [bytes]) -> int:
    return (TWELVE_BIT_MASK & packed_bytes[-2] << 8) | packed_bytes[-1]

您甚至可以考虑创建一个类来表示一个可迭代的LZW存档，就像这样(从我的头上，所以可能是不正确的)

class LzwArchive:

    def __init__(self, archive: typing.BinaryIO):
        self.archive = archive

    def __iter__(self):
        codes = self.archive.read(3)

        while codes:
            yield from LzwArchive._unpack_codes(codes)
            codes = self.archive.read(3)

    @staticmethod
    def _unpack_codes(codes: [bytes]) -> [int]:
        if len(codes) == 3:
            yield LzwArchive._get_first_twelve_bits(codes)
        yield LzwArchive._get_last_twelve_bits(codes)

    @staticmethod
    def _get_first_twelve_bits(packed_bytes: [bytes]) -> int:
        return (packed_bytes[0] << 4) | (packed_bytes[1] >> 4)

    @staticmethod
    def _get_last_twelve_bits(packed_bytes: [bytes]) -> int:
        return (TWELVE_BIT_MASK & packed_bytes[-2] << 8) | packed_bytes[-1]

然后你可以做这样的事情：

with open(myfile, 'rb') as f:
    archive = LzwArchive(f)
    for code in archive:
        process(code)

让发电机开动！我们也可以使convert_codes_to_text成为可迭代的

def convert_codes_to_text(compressed_codes: [int]) -> str:
    dictionary_size = MIN_DICT_SIZE
    dictionary = {i: chr(i) for i in range(dictionary_size)}

    text_data = ""
    previous_code = None

    # Loop over compressed compressed_codes and convert them to text values.
    for code in compressed_codes:
        try:
            yield dictionary[code]
        except KeyError:
            # Do your thing here, but something better than an assert

        if previous_code:
            new_code = dictionary[code][0] + dictionary[previous_code]
            if new_code not in dictionary:
                dictionary[dictionary_size] = new_code
                dictionary_size += 1

            if dictionary_size = MAX_DICT_SIZE:
                dictionary_size = MIN_DICT_SIZE

        previous_code = code

我觉得我们在这里做了太多的工作来维护我们的字典，所以我认为我们也可以在那里建立一个更聪明的数据结构。我们可以利用LzwDictionary使其成为子类dict，并真正像dict (实际上，子类defaultdict和/或OrderedDict也可以给您带来一些非常酷的东西)，这比您现在需要的更多。

def convert_codes_to_text(compressed_codes: [int]) -> str:
    dictionary = LzwDictionary()
    previous_code = None

    for code in compressed_codes:
        try:
            yield dictionary[code]
        except KeyError:
            # Do your thing here, but something better than an assert

        if previous_code:
            new_code = dictionary[code][0] + dictionary[previous_code]
            if new_code not in dictionary:
                dictionary.add_code(new_code)

        previous_code = code

class LzwDictionary:
    MAX_SIZE = 4096
    MIN_SIZE = 256

    def __init__(self):
        self.data = {i: chr(i) for i in range(LzwDictionary.MIN_SIZE)}
        self.size = LzwDictionary.MIN_SIZE

    def __getitem__(self, key):
        return self.data[key]

    def __contains__(self, key):
        return key in self.data

    def add_code(self, code):
        self.data[self.size] = code
        self.size += 1
        self._check_size()

    def _check_size(self):
        if self.size > LzwDictionary.MAX_SIZE:
            self.size = LzwDictionary.MIN_SIZE

此时，我们只需要实际使用迭代器；您可以执行''.join(convert_codes_to_text(codes))或增量写入。我可能最终会有这样的结果：

def extract_archive_to_file(input_path, output_path):

    with open(input_path, 'rb') as lz_file:
        archive = LzwArchive(lz_file)
        decompressed = convert_codes_to_text(archive)
        with open(output_path, 'w') as out_file:
            for character in decompressed:
                out_file.write(character)

所有这些都是在不访问Python的情况下编写的，所以请原谅任何输入或错误。

票数 2

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/189761

复制

相似问题

问LZW减压库
EN

回答 2

Code Review用户

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问LZW减压库EN

回答 2

Code Review用户

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问LZW减压库
EN