我正在实现LZ77压缩算法。为了压缩任何文件类型,我使用它的二进制表示,然后将它作为chars (因为1 char等于1字节,afaik)读入std::string。当前程序版本压缩和解压缩文件(.txt、.bmp等)--以字节表示的原始文件的大小与未压缩文件的大小相匹配。
现在是实际代码部分。下面是关于LZ77如何处理压缩的简短信息:


以下是两个主要功能:压缩和findLongestMatch:
compress在两个缓冲区之间移动字符数据,并保存编码的元组⟨偏移量、长度、nextchar⟩。findLongestMatch在historyBuffer中找到最长的lookheadBuffer匹配那么,一般说来,有什么方法可以提高效率(时间/记忆)呢?搜索比较?而且,我不太喜欢我的循环,有什么想法如何使事情更优雅吗?UPD:为了澄清问题,我希望得到所有的代码审查。谢谢!
代码:
LZ77::Triplet LZ77::slidingWindow::findLongestPrefix()
{
// Minimal tuple (if no match >1 is found)
Triplet n(0, 0, lookheadBuffer[0]);
size_t lookCurrLen = lookheadBuffer.length() - 1;
size_t histCurrLen = historyBuffer.length();
// Increasing the substring (match) length on every iteration
for (size_t i = 1; i <= std::min(lookCurrLen, histCurrLen); i++)
{
// Getting the substring
std::string s = lookheadBuffer.substr(0, i);
size_t pos = historyBuffer.find(s);
if (pos == std::string::npos)
break;
if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
pos = histCurrLen - i;
// check if there are any repeats
// following the of current longest substring in lookheadBuffer
int extra = 0;
if (histCurrLen == pos + i)
{
// Check for full following repeats
while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
extra += i;
// Check for partial following repeats
int extraextra = i - 1;
while (extraextra > 0)
{
if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
break;
extraextra--;
}
extra += extraextra;
}
// Compare the lengths of matches
if (n.length <= i + extra)
n = Triplet(histCurrLen - pos, i + extra, lookheadBuffer[i + extra]);
}
return n;
}
void LZ77::compress()
{
do
{
if ((window.lookheadBuffer.length() < window.lookBufferMax) && (byteDataString.length() != 0))
{
int len = window.lookBufferMax - window.lookheadBuffer.length();
window.lookheadBuffer.append(byteDataString, 0, len);
byteDataString.erase(0, len);
}
LZ77::Triplet tiplet = window.findLongestPrefix();
// Move the used part of lookheadBuffer to historyBuffer
window.historyBuffer.append(window.lookheadBuffer, 0, tiplet.length + 1);
window.lookheadBuffer.erase(0, tiplet.length + 1);
// If historyBuffer's size exceeds max, delete oldest part
if (window.historyBuffer.length() > window.histBufferMax)
window.historyBuffer.erase(0, window.historyBuffer.length() - window.histBufferMax);
encoded.push_back(tiplet);
} while (window.lookheadBuffer.length());
}
void LZ77::decompress()
{
for (auto code : encoded)
{
int length = code.length;
if (length)
{
// Getting the substring
std::string s = byteDataString.substr(byteDataString.length() - code.offset, std::min(length, code.offset));
// Considering the repeats
while (length)
{
int repeat = std::min(length, static_cast<int>(s.length()));
byteDataString.append(s, 0, repeat);
length -= repeat;
}
}
byteDataString.append(1, code.next);
}
}
void LZ77::readFileUncompressed(std::filesystem::path& path)
{
std::ifstream file(path, std::ios::in | std::ios::binary);
if (file.is_open())
{
// Building the byte data string
byteDataString = std::string(std::istreambuf_iterator<char>(file), {});
file.close();
}
else
throw std::ios_base::failure("Failed to open the file");
}
void LZ77::createFileCompressed(std::filesystem::path& path)
{
std::ofstream out(path / "packed.lz77", std::ios::out | std::ios::binary);
if (out.is_open())
{
for (auto triplet : encoded)
{
intToBytes(out, triplet.offset);
out << triplet.next;
intToBytes(out, triplet.length);
}
out.close();
}
else
throw std::ios_base::failure("Failed to open the file");
}
void LZ77::readFileCompressed(std::filesystem::path& path)
{
std::ifstream file(path, std::ios::in | std::ios::binary);
if (file.is_open())
{
Triplet element;
while (file.peek() != std::ifstream::traits_type::eof())
{
element.offset = intFromBytes(file);
file.get(element.next);
element.length = intFromBytes(file);
encoded.push_back(element);
}
file.close();
}
else
throw std::ios_base::failure("Failed to open the file");
}
void LZ77::createFileUncompressed(std::filesystem::path& path)
{
std::ofstream out(path / "unpacked.unlz77", std::ios::out | std::ios::binary);
out << byteDataString;
out.close();
}头文件:
#include <iostream>
#include <fstream>
#include <iterator>
#include <vector>
#include <string>
#include <filesystem>
#include <algorithm>
class LZ77
{
struct Triplet
{
int offset;
int length;
char next;
Triplet() = default;
Triplet(int off, int len, char next) : offset(off), length(len), next(next) {};
~Triplet() = default;
};
struct slidingWindow
{
// History and lookhead buffer
std::string lookheadBuffer;
std::string historyBuffer;
// Max buffers sizes
size_t lookBufferMax;
size_t histBufferMax;
slidingWindow() = default;
slidingWindow(std::string hbf, std::string lhbf) : historyBuffer(hbf), lookheadBuffer(lhbf) {}
~slidingWindow() = default;
// Function to get longest match
Triplet getLongestPrefix();
};
// Sliding window (2 buffers)
LZ77::slidingWindow window;
// Byte data string
std::string byteDataString;
// Vector of encoded tuples <offset, length, next>
std::vector<Triplet> encoded;
public:
void compress();
void decompress();
void readFileUncompressed(std::filesystem::path& pathToFile);
void createFileCompressed(std::filesystem::path& pathToFile);
void readFileCompressed(std::filesystem::path& pathToFile);
void createFileUncompressed(std::filesystem::path& pathToFile);
void reset()
{
encoded.clear();
window.historyBuffer.clear();
window.lookheadBuffer.clear();
byteDataString.clear();
}
LZ77(size_t lookBufMaxSize, size_t histBufMaxSize)
{
window.lookBufferMax = lookBufMaxSize * 1024;
window.histBufferMax = histBufMaxSize * 1024;
};
};发布于 2019-12-18 22:39:28
代码的主要问题是如何构造API。这有点奇怪,而且不是一般的。如果我想压缩一个文件,我需要编写四行代码:
LZ77 lz77(5, 7);
lz77.readFileUncompressed("somepath");
lz77.compress();
lz77.createFileCompressed("somepath");首先,我应该为lookBufMaxSize和histBufMaxSize指定哪些数字?除非我对算法了如指掌,否则我不知道什么数字是好的。设置默认值更好,设置这些参数的一种简单方法,类似于gzip如何将压缩级别从1级到9级。
其次,为什么您必须指定一个目录,而不是指定要读取或写入的实际文件名?如果应用程序能够选择完整的文件名,情况会更好。
第三,我需要读取该文件,压缩它,并将其以不同的步骤写入。但是,我不太可能想要以不同的顺序运行这些步骤。因此,让一个函数一次完成所有操作,这样我就可以编写这样的东西,这样就更有意义了:
LZ77 lz77(...);
lz77.compress("somepath/uncompressed.txt", "somepath/compressed.lz77");但是,更好的做法是避免实例化类。如果我们可以写:
LZ77::compress("somepath/uncompressed.txt", "somepath/compressed.lz77");可能有一个可选的第三个参数来设置压缩级别。这将使API更加简单,并防止错误的发生。例如,在您的API中,如果不在操作之间调用class LZ77实例,那么重用reset()实例可能是不安全的。
通过将类的输入和输出方法限制在读取和写入文件上,使代码变得不那么通用。如果应用程序已经有一些数据在内存中,并且希望通过网络发送压缩版本的数据,该怎么办?必须将其写入文件,然后对其进行压缩,将其写入另一个文件,然后再读取,这是非常低效率的。此外,如果应用程序确实希望对文件进行读写,但已经有一个开放的输入或输出流,那么如果可以传递这些流,则会好得多。
如果您支持某种流接口,那么拥有类保持状态将更有意义,在这种情况下,应用程序一次可以向状态中输入少量数据。
改善内存使用的一种方法是消除临时缓冲区,如果您做了前面提到的API更改,就可以这样做。如果您正在压缩N个字节的文件,这将为您节省2*N字节的内存。
https://codereview.stackexchange.com/questions/233865
复制相似问题