文章/答案/技术大牛

发布

社区首页 >问答首页 >C++ LZ77压缩算法

问C++ LZ77压缩算法
EN

Code Review用户

提问于 2017-05-24 00:03:40

回答 2查看 6.4K关注 0票数 4

我在使用int / size_t时遇到了困难，因为我们实际上不能减一个size_t。所以我不知道我是否很好地使用了size_t，但是我想尽可能大地处理输入字符串。

lz77编码和ilz77解码。

作为我在这里没有复制的外部函数，有：

int bin_to_dec(std::string bin)，它将二进制数字转换为十进制表示形式。

std::string dec_to_bin(int dec, int nb_bits=-1)，它执行反向操作:取一个十进制数，并在必要时在左边添加额外的0，从而输出其大小为nb_bits的二进制表示。

这是我的算法：

#include "BinaryTools.h"

#include <string>
#include <cmath>
#include <algorithm>
#include <tuple>


/**
returns a pair (p, n) where p is the position of the first character of the match relatively to the end of dictionary (p = 0 means no match)
    and n is the length of the match
the match is the longest substring of dictionary + lookahead_buffer which is a prefix of lookahead_buffer (where + represents string concatenation)
*/
std::pair<int, int> get_longest_match(const std::string& dictionary, const std::string& lookahead_buffer)
{
    int p = 0; // q = dictionary.size() - p
    int n = 0;

    int dict_size = dictionary.size();
    int buffer_size = lookahead_buffer.size();

    for(int q = 0; q < dictionary.size(); q++)
    {
        int local_n = 0;
        while(local_n < std::min(buffer_size, dict_size - q) && dictionary[q + local_n] == lookahead_buffer[local_n])
            local_n++;
        if(local_n == dict_size - q) // the match may continue in the buffer
        {
            int buffer_index = 0;
            while(local_n < buffer_size && lookahead_buffer[buffer_index] == lookahead_buffer[local_n])
            {
                buffer_index++;
                local_n++;
            }
        }
        if(local_n > n)
        {
            n = local_n;
            p = dict_size - q;
        }
    }

    std::cout << n << " ";
    return std::make_pair(p, n);
}

/**
0: char on 8 bits
1: tuple (p, n) on 15 + 8 bits (only if n >= 3, otherwise send chars)
    note: p is shifted 1 down and n is shifted 3 down
*/
std::string lz77(const std::string& input)
{
    size_t cursor = 0;
    size_t input_size = input.size();

    int dictionary_size = pow(2, 15); // (2^15 - 1) + 1 because p >= 1 if there is a match
    int buffer_size = pow(2, 8) + 2; // (2^8 - 1) + 3 because n >= 3

    std::string output, output_bin;
    size_t output_bin_cursor = 0;

    while(cursor < input_size)
    {
        // if cursor < dictionary_size then substr(0, cursor) else substr(cursor - dictionary_size, dictionary_size)
        std::string dictionary = input.substr(std::max(0, (int)cursor - dictionary_size), std::min(cursor, (size_t)dictionary_size)); // PB
        std::string buffer = input.substr(cursor, buffer_size);

        std::pair<int, int> x = get_longest_match(dictionary, buffer);

        int p = x.first; // in {1, ..., 2^15}, will be shifted to {0, ..., 2^15 - 1} ; 2^15 = 32768
        int n = x.second; // in {3, ..., 2^8 + 2}, will be shited to {0, ..., 2^8 - 1} ; 2^8 = 256
        if(n < 3)
        {
            output_bin += "0";
            output_bin += dec_to_bin((int)(unsigned char)input[cursor], 8);
            cursor++;
        }
        else
        {
            output_bin += "1";
            output_bin += dec_to_bin(p - 1, 15);
            output_bin += dec_to_bin(n - 3, 8);
            cursor += n;
        }

        int output_bin_size = output_bin.size();
        while(output_bin_size - output_bin_cursor >= 8)
        {
            std::string c = output_bin.substr(output_bin_cursor, 8);
            output += (char)bin_to_dec(c);
            output_bin_cursor += 8;
        }

        if(output_bin_size > output_bin.max_size() - 100)
        {
            output_bin.erase(0, cursor);
            output_bin_cursor = 0;
        }
    }

    // fill last byte with 0 (minimal size to encode a something is 9 anyway)
    output_bin.erase(0, output_bin_cursor);
    if(!output_bin.empty())
    {
        std::string fill_byte(8 - output_bin.size(), '0');
        output_bin += fill_byte;
        std::string c = output_bin.substr(0, 8);
        output += (char)bin_to_dec(c);
    }

    return output;
}

std::string ilz77(const std::string& input)
{
    size_t input_size = input.size();
    size_t input_cursor = 0;

    std::string input_bin;
    size_t input_bin_cursor = 0;

    // initial input_bin fill
    while(input_cursor < input_size && input_bin.size() < input_bin.max_size() - 100)
    {
        input_bin += dec_to_bin((int)(unsigned char)input[input_cursor], 8);
        input_cursor++;
    }

    std::string output;

    while(!input_bin.empty())
    {
        char type = input_bin[input_bin_cursor];
        input_bin_cursor++;

        if(input_bin_cursor > input_bin.size() || input_bin.size() - (int)input_bin_cursor < 8) // only last byte filling bits remain
            return output;

        if(type == '0')
        {
            std::string c = input_bin.substr(input_bin_cursor, 8);
            output += (char)bin_to_dec(c);
            input_bin_cursor += 8;
        }
        else
        {
            int p = bin_to_dec(input_bin.substr(input_bin_cursor, 15)) + 1;
            input_bin_cursor += 15;
            int n = bin_to_dec(input_bin.substr(input_bin_cursor, 8)) + 3;
            input_bin_cursor += 8;

            size_t match_begin = output.size() - p; // not just substr because of the eventual overlap
            for(int k = 0; k < n; k++)
                output += output[match_begin + k];
        }

        // refill input_bin
        if(input_bin.size() - (int)input_bin_cursor < 100 && input_cursor < input_size)
        {
            input_bin.erase(0, input_bin_cursor);
            input_bin_cursor = 0;

            while(input_cursor < input_size && input_bin.size() < input_bin.max_size() - 100)
            {
                input_bin += dec_to_bin((int)(unsigned char)input[input_cursor], 8);
                input_cursor++;
            }
        }
    }

    return output;
}

我的get_longest_match函数非常慢，但我已经看到并设想使用哈希表，稍后我将尝试实现该函数。

更新:根据请求，这是我的BinaryTools.h文件：

#ifndef BINARYTOOLS_H
#define BINARYTOOLS_H

#include <string>

std::string next_bin(std::string bin)
{
    int k = bin.size() - 1;
    while(bin[k] == '1')
    {
        bin[k] = '0';
        k--;
    }
    if(k == -1)
        bin = "1" + bin;
    else
        bin[k] = '1';
    return bin;
}

int bin_to_dec(std::string bin)
{
    int dec = 0;
    int n = bin.size();
    int k = 1;
    for(int i = 0; i < n; i++)
    {
        if(bin[n-1-i] == '1')
            dec += k;
        k *= 2;
    }
    return dec;
}

// if nb_bits > 1, add 0s at the beginning of the output string so that its size is nb_bits
std::string dec_to_bin(int dec, int nb_bits=-1)
{
    std::string bin;
    while(dec != 0)
    {
        if(dec % 2 == 0)
        {
            bin = "0" + bin;
            dec /= 2;
        }
        else
        {
            bin = "1" + bin;
            dec = (dec - 1) / 2;
        }
    }

    if(nb_bits > 0 && bin.size() < nb_bits)
    {
        std::string fill(nb_bits - bin.size(), '0');
        return fill + bin;
    }
    return bin;
}


#endif // BINARYTOOLS_H

c++

compression

回答 2

Code Review用户

发布于 2017-05-24 14:41:09

欢迎来到代码评审，这是一个不错的第一个问题。代码写得很好，可读性也很好。只是一些可能有助于改进代码的观察：

正如@TobySpeight所提到的，您应该将变量更改为size_t，这样警告消息就会消失。

缺少的头文件

代码丢失了

#include <iostream>

这导致了上面提到的bug @TobySpeight。

头文件中的

函数

显然，将函数体放入头文件是有效的，但是，将函数原型放入头文件和将函数体放入cpp源文件更为常见。这样做的原因是，如果包含函数体的头文件被多个文件包括在内，则函数现在被乘定义，用户在链接时会遇到多个定义错误。一种方法是使头文件中的函数是静态的，但是最好只是在头文件中有原型并将多个对象文件链接在一起。

降低复杂度，遵循SRP

单一责任原则指出，每个模块或类都应该对软件提供的功能的单个部分负有责任，这种责任应该完全由类封装。它的所有服务都应与这一责任密切配合。

Robert C. Martin expresses the principle as follows:
    `A class should have only one reason to change.`

虽然这主要针对面向对象语言中的类，但它很好地应用于函数和子程序。

lz77(const std::string& input)函数可以分解为多个函数，特别是外部while循环。内部虽然循环似乎是一个好的候选函数，第一个if语句的两个子句似乎也是很好的函数候选。

函数越独立，理解或读取代码就越容易。这也使任何程序员更容易维护或调试代码。

函数ilz77(const std::string& input)看起来也过于复杂，可能也会被分解。

数字常数与符号常数

有几个数字常量(8，15，3,100)确实没有得到解释。最好使用符号常量来给这些值赋予名称和意义。这在C++中做了很多事情

它清楚地定义了数字的含义，从而使代码更加可读性。
它为类型检查提供了一种类型，使C++成为一种强类型语言。
它允许在一个地方更改值，而不是通过遍历所有代码来进行必要的更改，从而使更改算法变得更容易。

类与过程

如果此代码的目标是一组库例程，那么将其封装在类中可能更有意义。这将为已经存在的功能提供一个或更少的接口来调用以实现压缩。

不一致变量命名约定

代码包含名字很好的变量，如dict_size和buffer_size，但是它包含的变量名为p、n和q，很难理解。请记住，代码需要维护，特性可能需要在3到5年内添加。即使是代码的原始作者，在这样长的时间之后，也可能很难维护具有单个字符变量名称的代码。

票数 2

Code Review用户

发布于 2017-05-24 12:38:36

您应该修复这些编译器警告：

g++ -std=c++17 -fPIC -g -Wall -Wextra -Wwrite-strings -Wno-parentheses -Wpedantic -Warray-bounds -Weffc++      164064.cpp    -o 164064
164064.cpp: In function ‘std::__cxx11::string dec_to_bin(int, int)’:
164064.cpp:53:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     if(nb_bits > 0 && bin.size() < nb_bits)
                       ~~~~~~~~~~~^~~~~~~~~

164064.cpp: In function ‘std::pair<int, int> get_longest_match(const string&, const string&)’:
164064.cpp:84:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(int q = 0; q < dictionary.size(); q++)
                    ~~^~~~~~~~~~~~~~~~~~~

164064.cpp: In function ‘std::__cxx11::string lz77(const string&)’:
164064.cpp:157:28: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         if(output_bin_size > output_bin.max_size() - 100)
            ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

要修复其中的中间部分，我认为get_longest_match()应该只对无符号类型进行操作：

std::pair<int, int> get_longest_match(const std::string& dictionary,
                                      const std::string& lookahead_buffer)
{
    auto n = 0u;

    auto const dict_size = dictionary.size();
    auto const buffer_size = lookahead_buffer.size();

    for (size_t q = 0; q < dictionary.size(); q++) {
        auto local_n = 0u;
        while (local_n < std::min(buffer_size, dict_size - q) && dictionary[q + local_n] == lookahead_buffer[local_n])
            ++local_n;

        if (local_n == dict_size - q) {
            // the match may continue in the buffer
            for (int buffer_index = 0;
                 local_n < buffer_size && lookahead_buffer[buffer_index] == lookahead_buffer[local_n];
                 ++buffer_index)
                ++local_n;
        }
        if (local_n > n)
            n = local_n;
    }

    return std::make_pair(dict_size - q, n);
}

另外，

164064.cpp:105:5: error: ‘cout’ is not a member of ‘std’
     std::cout << n << " ";
     ^~~

我猜那只是你无意中留下的一些错误的调试。

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/164064

复制

相似问题

问C++ LZ77压缩算法
EN

回答 2

Code Review用户

缺少的头文件

函数

降低复杂度，遵循SRP

数字常数与符号常数

类与过程

不一致变量命名约定

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问C++ LZ77压缩算法EN

回答 2

Code Review用户

缺少的头文件

函数

降低复杂度，遵循SRP

数字常数与符号常数

类与过程

不一致变量命名约定

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问C++ LZ77压缩算法
EN