首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >LZW减压库

LZW减压库
EN

Code Review用户
提问于 2018-03-16 13:32:12
回答 2查看 1.5K关注 0票数 6

我被要求为面试过程写一个解压缩程序库。是LZW解压缩库。我已经试过了。我正在寻找任何反馈的结构,演示,可读性从专业人士谁曾作为软件工程师。

LZW解释。解压缩部分中的第二个表解释了字典是如何构建的,在我的例子中,键大小被限制在12位。

代码语言:javascript
复制
"""
Decompressor library for Lempel–Ziv–Welch compression algorithm.
Main Functions:
    decompress(file_path)
    extract_archive_to_file(input_path, output_path)
Test code in the end.
"""


def read_compressed_codes_from_archive(file_path: str) -> [int]:
    """
    Reads file at specified path and returns content as an int array.
    :param file_path: path to archive
    :return: array of int codes
    """
    compressed_codes = []
    with open(file_path, "rb") as archive:
        # Read the 3 bytes, which contain two 12-bit codes
        bytes = archive.read(3)
        while bytes:
            if len(bytes) == 3:
                code1, code2 = unpack_codes_from_3_bytes(bytes)

                compressed_codes.append(code1)
                compressed_codes.append(code2)
            elif len(bytes) == 2:
                # Odd number of codes in the end of the file
                # Extract a single 12-bit code which is padded to 16-bits
                code = (0b0000111111111111 & (bytes[0] << 8)) | bytes[1]
                compressed_codes.append(code)

            bytes = archive.read(3)

    return compressed_codes


def unpack_codes_from_3_bytes(packed_bytes: [bytes]) -> (int, int):
    """
    Extracts two 12-bit numbers from 3 bytes (24-bits)
    :param packed_bytes: array of 3 int bytes
    :return: extracted int codes
    """
    # Create 16-bit int vars to store extracted codes in
    code1 = 0b0000111111111111
    code2 = 0b0000111111111111

    # Copy over 8 most significant bits and shift them to correct place
    code1 = code1 & packed_bytes[0] << 4
    # Copy over the 4 least significant bits from second byte
    code1 = code1 | (packed_bytes[1] >> 4)

    # Copy over the 4 most significant bits from second byte
    code2 = code2 & packed_bytes[1] << 8
    # Copy over the 8 least significant bits from third byte
    code2 = code2 | packed_bytes[2]

    return code1, code2


def convert_codes_to_text(compressed_codes: [int]) -> str:
    """
    Decompresses codes into text data.
    :param compressed_codes: compressed codes from archive
    :return: string data from compressed codes
    """
    # Create initial dictionary with individual char compressed_codes
    dictionary_size = 256
    dictionary = {i: chr(i) for i in range(dictionary_size)}

    text_data = ""
    previous_code = None

    # Loop over compressed compressed_codes and convert them to text values.
    for code in compressed_codes:
        if code in dictionary:
            text_data += dictionary[code]

            if previous_code:
                # Make sure new value is not already in the dict,
                # otherwise add it to the end.
                if dictionary[previous_code] + dictionary[code][0] not in \
                        dictionary.keys():
                    dictionary[dictionary_size] = dictionary[previous_code] \
                                                + dictionary[code][0]

                    dictionary_size += 1
                    # If dictionary is full, reset back to initial dict
                    # That is 4096 in binary ie one and 12 zeros
                    if dictionary_size == 0b1000000000000:
                        dictionary_size = 256
        else:
            assert ("Decompression error. Code {} not present in "
                    "dictionary".format(code))

        previous_code = code

    return text_data


def decompress(file_path: str) -> str:
    """
    Decompress an archive and get the text data.
    :param filepath: path to archive
    :return: text data as string
    """
    codes = read_compressed_codes_from_archive(file_path)
    text_data = convert_codes_to_text(codes)
    return text_data


def extract_archive_to_file(input_path: str, output_path: str):
    """
    Extracts an archive and saves it on disk.
    :param input_path: path to archive
    :param output_path: path to extracted file
    :return: none
    """
    text_data = decompress(input_path)
    with open(output_path, "w") as output_file:
        output_file.write(text_data)


def test_library():
    """
    Example test function.
    """
    file_path = "LzwInputData/compressedfile3.z"
    text_data = decompress(file_path)
    print(text_data)

    output_path = "LzwInputData/compressedfile3.txt"
    extract_archive_to_file(file_path, output_path)

test_library()

谢谢

EN

回答 2

Code Review用户

回答已采纳

发布于 2018-03-16 13:59:04

一般来说,这看起来像是写得很好,文档完整,干净的python代码--所以总体来说很好!但如果我非要挑剔的话,有几件事我可以提.

代码语言:javascript
复制
code1, code2 = unpack_codes_from_3_bytes(bytes)
compressed_codes.append(code1)
compressed_codes.append(code2)

可缩短为:

代码语言:javascript
复制
compressed_codes.extend(unpack_codes_from_3_bytes(bytes))

我有点不确定你为什么这么做:

代码语言:javascript
复制
if dictionary_size == 0b1000000000000:

首先,将其定义为int (dictionary_size = 256),然后惟一的修改是int加法(dictionary_size += 1),那么为什么要进行二进制比较呢?

你能解释一下为什么要使用assert()吗?在这样的模块中,我通常倾向于显式地引发异常--甚至可以生成自定义的异常类,以便模块的最终用户能够轻松地捕获它们--或者如果这是用于报告,则使用logging.error或类似的。

很少有地方可以生成变量,并且只使用它们一次,例如:

代码语言:javascript
复制
text_data = convert_codes_to_text(codes)
return text_data

可以是:

代码语言:javascript
复制
return convert_codes_to_text(codes)

或者:

代码语言:javascript
复制
text_data = decompress(input_path)
with open(output_path, "w") as output_file:
    output_file.write(text_data)

可以是:

代码语言:javascript
复制
with open(output_path, "w") as output_file:
    output_file.write(decompress(input_path))

最后,not in dictionary.keys():可以是not in dictionary:

票数 3
EN

Code Review用户

发布于 2018-03-19 21:41:12

首先,我建议您更改read_compressed_codes_from_archive的API,使其不是文件路径,而是一个类似文件的对象;请参见这个职位。如果有一天你不想只读一个文件,它会给你更多的自由。举个例子,我以前使用的一个应用程序(不是Python)最终遇到了一些问题,我们使用的第三方API只接受文件名,最后我们得到了错误报告,说它对大型文件崩溃了。事实证明,按照他们设计库的方式,他们只能一次加载整个文件来处理文件,而不是通过流或增量加载文件。显式地支持流或类似于启动的东西是处理它的好方法。

接下来,出于类似的原因,您应该让读者成为一个生成器。总的来说,我可能会重写代码,使其看起来像这样(注意,我不知道生成器的正确文档和类型语法是什么,所以我保留了原样)。

代码语言:javascript
复制
import typing

def read_compressed_codes_from_archive(archive: typing.BinaryIO) -> [int]:
    """
    Reads file at specified path and returns content as an int array.
    :param file_path: path to archive
    :return: array of int codes
    """

    bytes = archive.read(3)

    while bytes:
        if len(bytes) == 3:
            code1, code2 = unpack_codes_from_3_bytes(bytes)

            yield code1
            yield code2
        elif len(bytes) == 2:
            yield (TWELVE_BIT_MASK & (bytes[0] << 8)) | bytes[1]

        bytes = archive.read(3)

我还定义了一个全局常量,TWELVE_BIT_MASK = 0b0000111111111111,所以更清楚地知道在那里到底发生了什么。然后,我们可以简化位旋转一点;得到前12位数不需要任何掩蔽。从那里开始,我会将所有的数据都移动到unpack_codes_from_3_bytes中,并使用yield from来获取所有的值,不管在何种情况下。最后,我们可以编写一个函数来处理获取前12位和最后12位的每个用例,如下所示:

代码语言:javascript
复制
import typing

def read_compressed_codes_from_archive(archive: typing.BinaryIO) -> [int]
    bytes = archive.read(3)

    while bytes:
        yield from unpack_codes_from_3_bytes(bytes)
        bytes = archive.read(3)

def unpack_codes_from_3_bytes(packed_bytes: [bytes]) -> int:

    if len(bytes) == 3:
        yield get_first_twelve_bits(packed_bytes)
    yield get_last_twelve_bits(packed_bytes)

def get_first_twelve_bits(packed_bytes: [bytes]) -> int:
    return (packed_bytes[0] << 4) | (packed_bytes[1] >> 4)

def get_last_twelve_bits(packed_bytes: [bytes]) -> int:
    return (TWELVE_BIT_MASK & packed_bytes[-2] << 8) | packed_bytes[-1]

您甚至可以考虑创建一个类来表示一个可迭代的LZW存档,就像这样(从我的头上,所以可能是不正确的)

代码语言:javascript
复制
class LzwArchive:

    def __init__(self, archive: typing.BinaryIO):
        self.archive = archive

    def __iter__(self):
        codes = self.archive.read(3)

        while codes:
            yield from LzwArchive._unpack_codes(codes)
            codes = self.archive.read(3)

    @staticmethod
    def _unpack_codes(codes: [bytes]) -> [int]:
        if len(codes) == 3:
            yield LzwArchive._get_first_twelve_bits(codes)
        yield LzwArchive._get_last_twelve_bits(codes)

    @staticmethod
    def _get_first_twelve_bits(packed_bytes: [bytes]) -> int:
        return (packed_bytes[0] << 4) | (packed_bytes[1] >> 4)

    @staticmethod
    def _get_last_twelve_bits(packed_bytes: [bytes]) -> int:
        return (TWELVE_BIT_MASK & packed_bytes[-2] << 8) | packed_bytes[-1]

然后你可以做这样的事情:

代码语言:javascript
复制
with open(myfile, 'rb') as f:
    archive = LzwArchive(f)
    for code in archive:
        process(code)

让发电机开动!我们也可以使convert_codes_to_text成为可迭代的

代码语言:javascript
复制
def convert_codes_to_text(compressed_codes: [int]) -> str:
    dictionary_size = MIN_DICT_SIZE
    dictionary = {i: chr(i) for i in range(dictionary_size)}

    text_data = ""
    previous_code = None

    # Loop over compressed compressed_codes and convert them to text values.
    for code in compressed_codes:
        try:
            yield dictionary[code]
        except KeyError:
            # Do your thing here, but something better than an assert

        if previous_code:
            new_code = dictionary[code][0] + dictionary[previous_code]
            if new_code not in dictionary:
                dictionary[dictionary_size] = new_code
                dictionary_size += 1

            if dictionary_size = MAX_DICT_SIZE:
                dictionary_size = MIN_DICT_SIZE

        previous_code = code

我觉得我们在这里做了太多的工作来维护我们的字典,所以我认为我们也可以在那里建立一个更聪明的数据结构。我们可以利用LzwDictionary使其成为子类dict,并真正像dict (实际上,子类defaultdict和/或OrderedDict也可以给您带来一些非常酷的东西),这比您现在需要的更多。

代码语言:javascript
复制
def convert_codes_to_text(compressed_codes: [int]) -> str:
    dictionary = LzwDictionary()
    previous_code = None

    for code in compressed_codes:
        try:
            yield dictionary[code]
        except KeyError:
            # Do your thing here, but something better than an assert

        if previous_code:
            new_code = dictionary[code][0] + dictionary[previous_code]
            if new_code not in dictionary:
                dictionary.add_code(new_code)

        previous_code = code

class LzwDictionary:
    MAX_SIZE = 4096
    MIN_SIZE = 256

    def __init__(self):
        self.data = {i: chr(i) for i in range(LzwDictionary.MIN_SIZE)}
        self.size = LzwDictionary.MIN_SIZE

    def __getitem__(self, key):
        return self.data[key]

    def __contains__(self, key):
        return key in self.data

    def add_code(self, code):
        self.data[self.size] = code
        self.size += 1
        self._check_size()

    def _check_size(self):
        if self.size > LzwDictionary.MAX_SIZE:
            self.size = LzwDictionary.MIN_SIZE

此时,我们只需要实际使用迭代器;您可以执行''.join(convert_codes_to_text(codes))或增量写入。我可能最终会有这样的结果:

代码语言:javascript
复制
def extract_archive_to_file(input_path, output_path):

    with open(input_path, 'rb') as lz_file:
        archive = LzwArchive(lz_file)
        decompressed = convert_codes_to_text(archive)
        with open(output_path, 'w') as out_file:
            for character in decompressed:
                out_file.write(character)

所有这些都是在不访问Python的情况下编写的,所以请原谅任何输入或错误。

票数 2
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/189761

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档