我被要求为面试过程写一个解压缩程序库。是LZW解压缩库。我已经试过了。我正在寻找任何反馈的结构,演示,可读性从专业人士谁曾作为软件工程师。
LZW解释。解压缩部分中的第二个表解释了字典是如何构建的,在我的例子中,键大小被限制在12位。
"""
Decompressor library for Lempel–Ziv–Welch compression algorithm.
Main Functions:
decompress(file_path)
extract_archive_to_file(input_path, output_path)
Test code in the end.
"""
def read_compressed_codes_from_archive(file_path: str) -> [int]:
"""
Reads file at specified path and returns content as an int array.
:param file_path: path to archive
:return: array of int codes
"""
compressed_codes = []
with open(file_path, "rb") as archive:
# Read the 3 bytes, which contain two 12-bit codes
bytes = archive.read(3)
while bytes:
if len(bytes) == 3:
code1, code2 = unpack_codes_from_3_bytes(bytes)
compressed_codes.append(code1)
compressed_codes.append(code2)
elif len(bytes) == 2:
# Odd number of codes in the end of the file
# Extract a single 12-bit code which is padded to 16-bits
code = (0b0000111111111111 & (bytes[0] << 8)) | bytes[1]
compressed_codes.append(code)
bytes = archive.read(3)
return compressed_codes
def unpack_codes_from_3_bytes(packed_bytes: [bytes]) -> (int, int):
"""
Extracts two 12-bit numbers from 3 bytes (24-bits)
:param packed_bytes: array of 3 int bytes
:return: extracted int codes
"""
# Create 16-bit int vars to store extracted codes in
code1 = 0b0000111111111111
code2 = 0b0000111111111111
# Copy over 8 most significant bits and shift them to correct place
code1 = code1 & packed_bytes[0] << 4
# Copy over the 4 least significant bits from second byte
code1 = code1 | (packed_bytes[1] >> 4)
# Copy over the 4 most significant bits from second byte
code2 = code2 & packed_bytes[1] << 8
# Copy over the 8 least significant bits from third byte
code2 = code2 | packed_bytes[2]
return code1, code2
def convert_codes_to_text(compressed_codes: [int]) -> str:
"""
Decompresses codes into text data.
:param compressed_codes: compressed codes from archive
:return: string data from compressed codes
"""
# Create initial dictionary with individual char compressed_codes
dictionary_size = 256
dictionary = {i: chr(i) for i in range(dictionary_size)}
text_data = ""
previous_code = None
# Loop over compressed compressed_codes and convert them to text values.
for code in compressed_codes:
if code in dictionary:
text_data += dictionary[code]
if previous_code:
# Make sure new value is not already in the dict,
# otherwise add it to the end.
if dictionary[previous_code] + dictionary[code][0] not in \
dictionary.keys():
dictionary[dictionary_size] = dictionary[previous_code] \
+ dictionary[code][0]
dictionary_size += 1
# If dictionary is full, reset back to initial dict
# That is 4096 in binary ie one and 12 zeros
if dictionary_size == 0b1000000000000:
dictionary_size = 256
else:
assert ("Decompression error. Code {} not present in "
"dictionary".format(code))
previous_code = code
return text_data
def decompress(file_path: str) -> str:
"""
Decompress an archive and get the text data.
:param filepath: path to archive
:return: text data as string
"""
codes = read_compressed_codes_from_archive(file_path)
text_data = convert_codes_to_text(codes)
return text_data
def extract_archive_to_file(input_path: str, output_path: str):
"""
Extracts an archive and saves it on disk.
:param input_path: path to archive
:param output_path: path to extracted file
:return: none
"""
text_data = decompress(input_path)
with open(output_path, "w") as output_file:
output_file.write(text_data)
def test_library():
"""
Example test function.
"""
file_path = "LzwInputData/compressedfile3.z"
text_data = decompress(file_path)
print(text_data)
output_path = "LzwInputData/compressedfile3.txt"
extract_archive_to_file(file_path, output_path)
test_library()谢谢
发布于 2018-03-16 13:59:04
一般来说,这看起来像是写得很好,文档完整,干净的python代码--所以总体来说很好!但如果我非要挑剔的话,有几件事我可以提.
code1, code2 = unpack_codes_from_3_bytes(bytes)
compressed_codes.append(code1)
compressed_codes.append(code2)可缩短为:
compressed_codes.extend(unpack_codes_from_3_bytes(bytes))我有点不确定你为什么这么做:
if dictionary_size == 0b1000000000000:首先,将其定义为int (dictionary_size = 256),然后惟一的修改是int加法(dictionary_size += 1),那么为什么要进行二进制比较呢?
你能解释一下为什么要使用assert()吗?在这样的模块中,我通常倾向于显式地引发异常--甚至可以生成自定义的异常类,以便模块的最终用户能够轻松地捕获它们--或者如果这是用于报告,则使用logging.error或类似的。
很少有地方可以生成变量,并且只使用它们一次,例如:
text_data = convert_codes_to_text(codes)
return text_data可以是:
return convert_codes_to_text(codes)或者:
text_data = decompress(input_path)
with open(output_path, "w") as output_file:
output_file.write(text_data)可以是:
with open(output_path, "w") as output_file:
output_file.write(decompress(input_path))最后,not in dictionary.keys():可以是not in dictionary:
发布于 2018-03-19 21:41:12
首先,我建议您更改read_compressed_codes_from_archive的API,使其不是文件路径,而是一个类似文件的对象;请参见这个职位。如果有一天你不想只读一个文件,它会给你更多的自由。举个例子,我以前使用的一个应用程序(不是Python)最终遇到了一些问题,我们使用的第三方API只接受文件名,最后我们得到了错误报告,说它对大型文件崩溃了。事实证明,按照他们设计库的方式,他们只能一次加载整个文件来处理文件,而不是通过流或增量加载文件。显式地支持流或类似于启动的东西是处理它的好方法。
接下来,出于类似的原因,您应该让读者成为一个生成器。总的来说,我可能会重写代码,使其看起来像这样(注意,我不知道生成器的正确文档和类型语法是什么,所以我保留了原样)。
import typing
def read_compressed_codes_from_archive(archive: typing.BinaryIO) -> [int]:
"""
Reads file at specified path and returns content as an int array.
:param file_path: path to archive
:return: array of int codes
"""
bytes = archive.read(3)
while bytes:
if len(bytes) == 3:
code1, code2 = unpack_codes_from_3_bytes(bytes)
yield code1
yield code2
elif len(bytes) == 2:
yield (TWELVE_BIT_MASK & (bytes[0] << 8)) | bytes[1]
bytes = archive.read(3)我还定义了一个全局常量,TWELVE_BIT_MASK = 0b0000111111111111,所以更清楚地知道在那里到底发生了什么。然后,我们可以简化位旋转一点;得到前12位数不需要任何掩蔽。从那里开始,我会将所有的数据都移动到unpack_codes_from_3_bytes中,并使用yield from来获取所有的值,不管在何种情况下。最后,我们可以编写一个函数来处理获取前12位和最后12位的每个用例,如下所示:
import typing
def read_compressed_codes_from_archive(archive: typing.BinaryIO) -> [int]
bytes = archive.read(3)
while bytes:
yield from unpack_codes_from_3_bytes(bytes)
bytes = archive.read(3)
def unpack_codes_from_3_bytes(packed_bytes: [bytes]) -> int:
if len(bytes) == 3:
yield get_first_twelve_bits(packed_bytes)
yield get_last_twelve_bits(packed_bytes)
def get_first_twelve_bits(packed_bytes: [bytes]) -> int:
return (packed_bytes[0] << 4) | (packed_bytes[1] >> 4)
def get_last_twelve_bits(packed_bytes: [bytes]) -> int:
return (TWELVE_BIT_MASK & packed_bytes[-2] << 8) | packed_bytes[-1]您甚至可以考虑创建一个类来表示一个可迭代的LZW存档,就像这样(从我的头上,所以可能是不正确的)
class LzwArchive:
def __init__(self, archive: typing.BinaryIO):
self.archive = archive
def __iter__(self):
codes = self.archive.read(3)
while codes:
yield from LzwArchive._unpack_codes(codes)
codes = self.archive.read(3)
@staticmethod
def _unpack_codes(codes: [bytes]) -> [int]:
if len(codes) == 3:
yield LzwArchive._get_first_twelve_bits(codes)
yield LzwArchive._get_last_twelve_bits(codes)
@staticmethod
def _get_first_twelve_bits(packed_bytes: [bytes]) -> int:
return (packed_bytes[0] << 4) | (packed_bytes[1] >> 4)
@staticmethod
def _get_last_twelve_bits(packed_bytes: [bytes]) -> int:
return (TWELVE_BIT_MASK & packed_bytes[-2] << 8) | packed_bytes[-1]然后你可以做这样的事情:
with open(myfile, 'rb') as f:
archive = LzwArchive(f)
for code in archive:
process(code)让发电机开动!我们也可以使convert_codes_to_text成为可迭代的
def convert_codes_to_text(compressed_codes: [int]) -> str:
dictionary_size = MIN_DICT_SIZE
dictionary = {i: chr(i) for i in range(dictionary_size)}
text_data = ""
previous_code = None
# Loop over compressed compressed_codes and convert them to text values.
for code in compressed_codes:
try:
yield dictionary[code]
except KeyError:
# Do your thing here, but something better than an assert
if previous_code:
new_code = dictionary[code][0] + dictionary[previous_code]
if new_code not in dictionary:
dictionary[dictionary_size] = new_code
dictionary_size += 1
if dictionary_size = MAX_DICT_SIZE:
dictionary_size = MIN_DICT_SIZE
previous_code = code我觉得我们在这里做了太多的工作来维护我们的字典,所以我认为我们也可以在那里建立一个更聪明的数据结构。我们可以利用LzwDictionary使其成为子类dict,并真正像dict (实际上,子类defaultdict和/或OrderedDict也可以给您带来一些非常酷的东西),这比您现在需要的更多。
def convert_codes_to_text(compressed_codes: [int]) -> str:
dictionary = LzwDictionary()
previous_code = None
for code in compressed_codes:
try:
yield dictionary[code]
except KeyError:
# Do your thing here, but something better than an assert
if previous_code:
new_code = dictionary[code][0] + dictionary[previous_code]
if new_code not in dictionary:
dictionary.add_code(new_code)
previous_code = code
class LzwDictionary:
MAX_SIZE = 4096
MIN_SIZE = 256
def __init__(self):
self.data = {i: chr(i) for i in range(LzwDictionary.MIN_SIZE)}
self.size = LzwDictionary.MIN_SIZE
def __getitem__(self, key):
return self.data[key]
def __contains__(self, key):
return key in self.data
def add_code(self, code):
self.data[self.size] = code
self.size += 1
self._check_size()
def _check_size(self):
if self.size > LzwDictionary.MAX_SIZE:
self.size = LzwDictionary.MIN_SIZE此时,我们只需要实际使用迭代器;您可以执行''.join(convert_codes_to_text(codes))或增量写入。我可能最终会有这样的结果:
def extract_archive_to_file(input_path, output_path):
with open(input_path, 'rb') as lz_file:
archive = LzwArchive(lz_file)
decompressed = convert_codes_to_text(archive)
with open(output_path, 'w') as out_file:
for character in decompressed:
out_file.write(character)所有这些都是在不访问Python的情况下编写的,所以请原谅任何输入或错误。
https://codereview.stackexchange.com/questions/189761
复制相似问题