文章/答案/技术大牛

发布

社区首页 >问答首页 >LZ77压缩算法(一般编码效率)

问LZ77压缩算法(一般编码效率)
EN

Code Review用户

提问于 2019-12-11 12:40:51

回答 1查看 1.8K关注 0票数 7

我正在实现LZ77压缩算法。为了压缩任何文件类型，我使用它的二进制表示，然后将它作为chars (因为1 char等于1字节，afaik)读入std::string。当前程序版本压缩和解压缩文件(.txt、.bmp等)--以字节表示的原始文件的大小与未压缩文件的大小相匹配。

现在是实际代码部分。下面是关于LZ77如何处理压缩的简短信息：

以下是两个主要功能:压缩和findLongestMatch：

compress在两个缓冲区之间移动字符数据，并保存编码的元组⟨偏移量、长度、nextchar⟩。
findLongestMatch在historyBuffer中找到最长的lookheadBuffer匹配

那么，一般说来，有什么方法可以提高效率(时间/记忆)呢？搜索比较？而且，我不太喜欢我的循环，有什么想法如何使事情更优雅吗？UPD:为了澄清问题，我希望得到所有的代码审查。谢谢!

代码：

LZ77::Triplet LZ77::slidingWindow::findLongestPrefix()
{
    // Minimal tuple (if no match >1 is found)
    Triplet n(0, 0, lookheadBuffer[0]);

    size_t lookCurrLen = lookheadBuffer.length() - 1;
    size_t histCurrLen = historyBuffer.length();

    // Increasing the substring (match) length on every iteration 
    for (size_t i = 1; i <= std::min(lookCurrLen, histCurrLen); i++)
    {
        // Getting the substring
        std::string s = lookheadBuffer.substr(0, i);

        size_t pos = historyBuffer.find(s);
        if (pos == std::string::npos)
            break;

        if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
            pos = histCurrLen - i;

        // check if there are any repeats
       // following the of current longest substring in lookheadBuffer
        int extra = 0;
        if (histCurrLen == pos + i)
        {
            // Check for full following repeats
            while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
                extra += i;

            // Check for partial following repeats
            int extraextra = i - 1;
            while (extraextra > 0)
            {
                if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
                    break;
                extraextra--;
            }

            extra += extraextra;
        }

        // Compare the lengths of matches
        if (n.length <= i + extra)
            n = Triplet(histCurrLen - pos, i + extra, lookheadBuffer[i + extra]);
    }

    return n;
}

void LZ77::compress()
{
    do
    {
        if ((window.lookheadBuffer.length() < window.lookBufferMax) && (byteDataString.length() != 0))
        {
            int len = window.lookBufferMax - window.lookheadBuffer.length();
            window.lookheadBuffer.append(byteDataString, 0, len);
            byteDataString.erase(0, len);
        }

        LZ77::Triplet tiplet = window.findLongestPrefix();

        // Move the used part of lookheadBuffer to historyBuffer
        window.historyBuffer.append(window.lookheadBuffer, 0, tiplet.length + 1);
        window.lookheadBuffer.erase(0, tiplet.length + 1);

        // If historyBuffer's size exceeds max, delete oldest part 
        if (window.historyBuffer.length() > window.histBufferMax)
            window.historyBuffer.erase(0, window.historyBuffer.length() - window.histBufferMax);

        encoded.push_back(tiplet);

    } while (window.lookheadBuffer.length());
} 

void LZ77::decompress()
{
    for (auto code : encoded)
    {
        int length = code.length;
        if (length)
        {
            // Getting the substring
            std::string s = byteDataString.substr(byteDataString.length() - code.offset, std::min(length, code.offset));
            // Considering the repeats
            while (length)
            {
                int repeat = std::min(length, static_cast<int>(s.length()));
                byteDataString.append(s, 0, repeat);
                length -= repeat;
            }
        }
        byteDataString.append(1, code.next);
    }
}


void LZ77::readFileUncompressed(std::filesystem::path& path)
{
    std::ifstream file(path, std::ios::in | std::ios::binary);

    if (file.is_open())
    {
        // Building the byte data string
        byteDataString = std::string(std::istreambuf_iterator<char>(file), {});
        file.close();
    }
    else
        throw std::ios_base::failure("Failed to open the file");
}


void LZ77::createFileCompressed(std::filesystem::path& path)
{
    std::ofstream out(path / "packed.lz77", std::ios::out | std::ios::binary);

    if (out.is_open())
    {
        for (auto triplet : encoded)
        {
            intToBytes(out, triplet.offset);
            out << triplet.next;
            intToBytes(out, triplet.length);
        }
        out.close();
    }
    else
        throw std::ios_base::failure("Failed to open the file");
}


void LZ77::readFileCompressed(std::filesystem::path& path)
{
    std::ifstream file(path, std::ios::in | std::ios::binary);

    if (file.is_open())
    {

        Triplet element;

        while (file.peek() != std::ifstream::traits_type::eof())
        {
            element.offset = intFromBytes(file);
            file.get(element.next);
            element.length = intFromBytes(file);
            encoded.push_back(element);
        }
        file.close();
    }
    else
        throw std::ios_base::failure("Failed to open the file");
}


void LZ77::createFileUncompressed(std::filesystem::path& path)
{
    std::ofstream out(path / "unpacked.unlz77", std::ios::out | std::ios::binary);
    out << byteDataString;
    out.close();
}

头文件：

#include <iostream>
#include <fstream>
#include <iterator>
#include <vector>
#include <string>
#include <filesystem>
#include <algorithm>


class LZ77
{
    struct Triplet
    {
        int offset;
        int length;
        char next;

        Triplet() = default;
        Triplet(int off, int len, char next) : offset(off), length(len), next(next) {};
        ~Triplet() = default;
    };

    struct slidingWindow
    {
        // History and lookhead buffer
        std::string lookheadBuffer;
        std::string historyBuffer;

        // Max buffers sizes
        size_t lookBufferMax;
        size_t histBufferMax;

        slidingWindow() = default;
        slidingWindow(std::string hbf, std::string lhbf) : historyBuffer(hbf), lookheadBuffer(lhbf) {}
        ~slidingWindow() = default;

        // Function to get longest match
        Triplet getLongestPrefix();
    };

    // Sliding window (2 buffers)
    LZ77::slidingWindow window;

    // Byte data string
    std::string byteDataString;

    // Vector of encoded tuples <offset, length, next>
    std::vector<Triplet> encoded;

public:

    void compress();

    void decompress();

    void readFileUncompressed(std::filesystem::path& pathToFile);

    void createFileCompressed(std::filesystem::path& pathToFile);

    void readFileCompressed(std::filesystem::path& pathToFile);

    void createFileUncompressed(std::filesystem::path& pathToFile);

    void reset()
    {
        encoded.clear();
        window.historyBuffer.clear();
        window.lookheadBuffer.clear();
        byteDataString.clear();
    }

    LZ77(size_t lookBufMaxSize, size_t histBufMaxSize)
    {
        window.lookBufferMax = lookBufMaxSize * 1024;
        window.histBufferMax = histBufMaxSize * 1024;
    };
};

c++

performance

c++17

compression

回答 1

Code Review用户

发布于 2019-12-18 22:39:28

重新工作API

代码的主要问题是如何构造API。这有点奇怪，而且不是一般的。如果我想压缩一个文件，我需要编写四行代码：

LZ77 lz77(5, 7);
lz77.readFileUncompressed("somepath");
lz77.compress();
lz77.createFileCompressed("somepath");

首先，我应该为lookBufMaxSize和histBufMaxSize指定哪些数字？除非我对算法了如指掌，否则我不知道什么数字是好的。设置默认值更好，设置这些参数的一种简单方法，类似于gzip如何将压缩级别从1级到9级。

其次，为什么您必须指定一个目录，而不是指定要读取或写入的实际文件名？如果应用程序能够选择完整的文件名，情况会更好。

第三，我需要读取该文件，压缩它，并将其以不同的步骤写入。但是，我不太可能想要以不同的顺序运行这些步骤。因此，让一个函数一次完成所有操作，这样我就可以编写这样的东西，这样就更有意义了：

LZ77 lz77(...);
lz77.compress("somepath/uncompressed.txt", "somepath/compressed.lz77");

但是，更好的做法是避免实例化类。如果我们可以写：

LZ77::compress("somepath/uncompressed.txt", "somepath/compressed.lz77");

可能有一个可选的第三个参数来设置压缩级别。这将使API更加简单，并防止错误的发生。例如，在您的API中，如果不在操作之间调用class LZ77实例，那么重用reset()实例可能是不安全的。

通过将类的输入和输出方法限制在读取和写入文件上，使代码变得不那么通用。如果应用程序已经有一些数据在内存中，并且希望通过网络发送压缩版本的数据，该怎么办？必须将其写入文件，然后对其进行压缩，将其写入另一个文件，然后再读取，这是非常低效率的。此外，如果应用程序确实希望对文件进行读写，但已经有一个开放的输入或输出流，那么如果可以传递这些流，则会好得多。

如果您支持某种流接口，那么拥有类保持状态将更有意义，在这种情况下，应用程序一次可以向状态中输入少量数据。

避免大型临时缓冲区

改善内存使用的一种方法是消除临时缓冲区，如果您做了前面提到的API更改，就可以这样做。如果您正在压缩N个字节的文件，这将为您节省2*N字节的内存。

票数 3

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/233865

复制

相似问题

问LZ77压缩算法(一般编码效率)
EN

回答 1

Code Review用户

重新工作API

避免大型临时缓冲区

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问LZ77压缩算法(一般编码效率)EN

回答 1

Code Review用户

重新工作API

避免大型临时缓冲区

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问LZ77压缩算法(一般编码效率)
EN