blocks|key|4536249|text|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4536250|我不想把字典和每个字符串一起存储，因为那样会有很高的开销。|blockquote|4536251|4536252|4536253|因此，构建包含所有所需内容的单个字符串，并使用任何解决方案一次性压缩所有内容。这也解决了“头太大”的问题。|4536254|您可以通过多种方式完成此操作。可能最简单的方法是创建字符串列表的repr()；或者您可以使用pickle、shelve或json模块来创建某种其他类型的序列化形式。|offset|length|style|CODE|4536255|entityMap^0|0|0|0|0|0|W|6|1A|6|1H|6|1O|4|0^^$0|@$1|2|3|-4|4|5|6|P|7|@]|8|@]|9|$]]|$1|A|3|B|4|C|6|Q|7|@]|8|@]|9|$]]|$1|D|3|-4|4|5|6|R|7|@]|8|@]|9|$]]|$1|E|3|-4|4|5|6|S|7|@]|8|@]|9|$]]|$1|F|3|G|4|5|6|T|7|@]|8|@]|9|$]]|$1|H|3|I|4|5|6|U|7|@$J|V|K|W|L|M]|$J|X|K|Y|L|M]|$J|Z|K|10|L|M]|$J|11|K|12|L|M]]|8|@]|9|$]]|$1|N|3|-4|4|5|6|13|7|@]|8|@]|9|$]]]|O|$]]

<blockquote>
 I don't want the dictionary stored with each string, because that would be high overhead.
</blockquote>

So build a single string with all of the desired contents, and compress it all at once with whichever solution. This solves the "header is too large" problem as well.

You can do this in a variety of ways. Probably the simplest is to create the <code>repr()</code> of a list of the strings; or you can use the <code>pickle</code>, <code>shelve</code> or <code>json</code> modules to create some other sort of serialized form.

blocks|key|971322|text|把所有的单词编成一本字典。然后，将所有单词转换为与字典中的偏移量相对应的数字。如果需要，您可以使用第一位来表示单词是大写的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|971323|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Make a dictionary of all words. Then, convert all words to numbers corresponding to the offset in the dictionary. If needed, you can use the first bit to indicate that the word is capitalized.

blocks|key|977517|text|使用标准库中的zipfile怎么样？|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|977518|entityMap|0|LINK|mutability|MUTABLE|url|http://docs.python.org/library/zipfile.html^0|7|7|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

How about using <a href="http://docs.python.org/library/zipfile.html" rel="nofollow">zipfile</a> from the standard library?

blocks|key|752463|text|英语字符串中的不同字符不超过128个。因此，您可以使用7位代码来描述每个字符。请参阅Compressing+UTF-8(or+other+8-bit+encoding)+to+7+or+fewer+bits|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|752464|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/1837686/compressing-utf-8or-other-8-bit-encoding-to-7-or-fewer-bits^0|16|1P|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

There are no more than 128 different characters in English strings. Hence you can describe each character with a 7bits code. See <a href="https://stackoverflow.com/questions/1837686/compressing-utf-8or-other-8-bit-encoding-to-7-or-fewer-bits">Compressing UTF-8(or other 8-bit encoding) to 7 or fewer bits</a>

blocks|key|3068642|text|首先，如果您单独压缩每个20字节的字符串，您的压缩比将非常糟糕。您需要将许多字符串压缩在一起，才能真正看到一些切实的好处。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3068643|其次，80M的字符串太多了，如果你必须解压缩所有的字符串来提取其中的一个，你会对性能不满意。将你的输入分成更小但仍然足够大的块。典型的值是64KB，转换为3200个字符串。|3068644|然后，您可以独立地压缩每个64KB的块。当您需要将单个字符串访问到块中时，您需要解码整个块。|3068645|因此，这里需要在压缩率(更喜欢较大的块)和随机访问速度(更喜欢较小的块)之间进行权衡。你将是评委来选择最好的一个。|3068646|快速注意:内存结构上的随机访问通常更倾向于快速压缩算法，而不是强压缩算法。如果你只压缩一次，但是随机访问很多次，那么你可以选择一些高度不对称的算法，比如LZ4-HC：http://code.google.com/p/lz4hc/|offset|length|3068647|根据基准测试，压缩速度仅为15MB/s，但解码速度约为1+1GB/s。这相当于每秒解码64KB的16K块……|3068648|entityMap|0|LINK|mutability|MUTABLE|url|http://code.google.com/p/lz4hc/^0|0|0|0|0|2B|V|0|0|0^^$0|@$1|2|3|4|5|6|7|V|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|W|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|X|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|Y|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|Z|8|@]|9|@$J|10|K|11|1|12]]|A|$]]|$1|L|3|M|5|6|7|13|8|@]|9|@]|A|$]]|$1|N|3|-4|5|6|7|14|8|@]|9|@]|A|$]]]|O|$P|$5|Q|R|S|A|$T|U]]]]

First, if you compress each 20-bytes string individually, your compression ratio will be miserable. You need to compress a lot of strings together to really witness some tangible benefits.

Second, 80M strings is a lot, and if you have to decompress them all to extract a single one of them, you'll be displeased by performance. Chunk your input into smaller but still large enough blocks. A typical value would be 64KB, translating into 3200 strings.

Then, you can compress each 64KB block independantly. When you need to access a single string into the block, you need to decode the entire block.

So here, there is a trade-off to decide between compression ratio (which prefer larger blocks) and random access speed (which prefer smaller blocks). You'll be the judge to select the best one.

Quick note : random access on in-memory structure usually favor fast compression algorithm, rather than strong ones. If you compress only once, but random access a lot of times, prefer some highly assymetric algorithms, such as LZ4-HC :
<a href="http://code.google.com/p/lz4hc/" rel="nofollow">http://code.google.com/p/lz4hc/</a>

According to benchmark, compression speed is only 15MB/s, but decoding speed is about 1GB/s. That translates into 16K blocks of 64KB decoded per second...

I would like to fit 80M strings of length &lt; 20 characters in memory and use as little memory as possible.

I would like a compression library that I can drive from Python, that will allow me to compress short (&lt;20 char) English strings. I have about 80M of them, and I would like them to fit in as little memory as possible.

I would like maximum lossless compression. CPU time is not the bottleneck.

I don't want the dictionary stored with each string, because that would be high overhead.

I want to compress to &lt;20% the original size. This is plausible, given that the upper bound of the entropy of English is 1.75 bits (Brown et al, 1992, <a href="http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf" rel="nofollow">http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf</a>) = 22% compression (1.75/8).

Edit:

I can't use zlib because the header is too large. (If I have a string that starts at 20 bytes, there can be NO header for there to be good compression. zlib header = 200 bytes according to Roland Illing. I haven't doublechecked, but I know it's bigger than 20.)

Huffman coding sounds nice, except it is based upon individual tokens, and can't do ngrams (multiple characters).

smaz has a crappy dictionary, and compresses to only 50%.

I strongly prefer to use existing code, rather than implement a compression algorithm.

Compressing short English strings in Python?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我想在内存中容纳长度小于20个字符的80M字符串，并尽可能少地使用内存。我想要一个压缩库，我可以从Python驱动，这将允许我压缩短(<20个字符)的英文字符串。我有大约8000万的内存，我希望它们能放在尽可能少的内存中。我想要最大程度的无损压缩。CPU时间不是瓶颈。我不希望字典与每个字符串一起存储，因为这将是很高的开销。我想压缩到原始尺寸的20%以下。这是合理的，因为英语熵的上限是1.75bit

问在Python中压缩英文短字符串？
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中压缩英文短字符串？EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python中压缩英文短字符串？
EN