blocks|key|1351937|text|你发现的问题是很常见的，因为difflib没有被优化。下面是我多年来在开发一个比较HTML文档的工具时发现的一些技巧。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1351938|文件可以放入内存中|1351939|创建两个列表，其中包含每个文件中的行。然后以列表作为参数调用difflib.SequenceMatcher。SequenceMatcher知道如何处理列表，而且处理速度会快得多，因为它是逐行完成的，而不是逐个字符。这可能会降低精度。|1351940|看看fuzzy_string_cmp.py和diff.py，看看我是如何做到这一点的。|1351941|替代方案|1351942|有一个名为diff_match_patch的很棒的库，它可以在pypi中使用。该库将在两个字符串之间执行fast差异并返回更改(添加行、相等行、删除行)。|BOLD|1351943|通过利用diff_match_patch，您应该能够创建自己的dmp_quick_ratio函数。|1351944|在diff.py中，您可以看到我是如何使用库来获得创建dmp_quick_ratio的灵感的。|1351945|我的测试表明，使用diff_match_patch比使用Python的difflib快20倍。|1351946|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/andresriancho/w3af/blob/43aeb4482a3701a05a0c0c594d22321b9969c6b6/w3af/core/controllers/misc/fuzzy_string_cmp.py#L86-L99|1|https://github.com/andresriancho/w3af/blob/43aeb4482a3701a05a0c0c594d22321b9969c6b6/w3af/core/controllers/misc/diff.py#L67-L120|2|https://pypi.org/project/diff-match-patch/|3|4|https://github.com/andresriancho/w3af/blob/43aeb4482a3701a05a0c0c594d22321b9969c6b6/w3af/core/controllers/misc/diff.py#L30-L64|5^0|E|7|0|0|U|N|1I|F|0|2|J|0|M|7|1|0|0|1G|4|5|G|2|0|V|F|4|G|3|0|R|F|1|7|4|0|Z|7|9|G|5|0^^$0|@$1|2|3|4|5|6|7|1C|8|@$9|1D|A|1E|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|1F|8|@]|D|@]|E|$]]|$1|H|3|I|5|6|7|1G|8|@$9|1H|A|1I|B|C]|$9|1J|A|1K|B|C]]|D|@]|E|$]]|$1|J|3|K|5|6|7|1L|8|@]|D|@$9|1M|A|1N|1|1O]|$9|1P|A|1Q|1|1R]]|E|$]]|$1|L|3|M|5|6|7|1S|8|@]|D|@]|E|$]]|$1|N|3|O|5|6|7|1T|8|@$9|1U|A|1V|B|P]]|D|@$9|1W|A|1X|1|1Y]]|E|$]]|$1|Q|3|R|5|6|7|1Z|8|@$9|20|A|21|B|C]]|D|@$9|22|A|23|1|24]]|E|$]]|$1|S|3|T|5|6|7|25|8|@$9|26|A|27|B|C]]|D|@$9|28|A|29|1|2A]]|E|$]]|$1|U|3|V|5|6|7|2B|8|@$9|2C|A|2D|B|C]]|D|@$9|2E|A|2F|1|2G]]|E|$]]|$1|W|3|-4|5|6|7|2H|8|@]|D|@]|E|$]]]|X|$Y|$5|Z|10|11|E|$12|13]]|14|$5|Z|10|11|E|$12|15]]|16|$5|Z|10|11|E|$12|17]]|18|$5|Z|10|11|E|$12|17]]|19|$5|Z|10|11|E|$12|1A]]|1B|$5|Z|10|11|E|$12|17]]]]

The issue you're finding is very common, since <code>difflib</code> is not optimized. Here are some tricks I've found over the years while developing a tool that compares HTML documents.

<h2>Files fit in memory</h2>

Create two lists, containing the lines from each file. Then call <code>difflib.SequenceMatcher</code> with the lists as parameters. The <code>SequenceMatcher</code> knows how to handle lists, and the process will be much faster since it is done on a line by line basis instead of char by char. This might reduce the precision.

Take a look at <a href="https://github.com/andresriancho/w3af/blob/43aeb4482a3701a05a0c0c594d22321b9969c6b6/w3af/core/controllers/misc/fuzzy_string_cmp.py#L86-L99" rel="nofollow noreferrer">fuzzy_string_cmp.py</a> and <a href="https://github.com/andresriancho/w3af/blob/43aeb4482a3701a05a0c0c594d22321b9969c6b6/w3af/core/controllers/misc/diff.py#L67-L120" rel="nofollow noreferrer">diff.py</a> to see how I'm doing exactly this.

<h2>Alternative</h2>

There is a great library called <a href="https://pypi.org/project/diff-match-patch/" rel="nofollow noreferrer">diff_match_patch</a> which is available in pypi. The library will perform fast diffs between two strings and return the changes (line added, line equal, line removed).

By leveraging <a href="https://pypi.org/project/diff-match-patch/" rel="nofollow noreferrer">diff_match_patch</a> you should be able to create your own <code>dmp_quick_ratio</code> function.

In <a href="https://github.com/andresriancho/w3af/blob/43aeb4482a3701a05a0c0c594d22321b9969c6b6/w3af/core/controllers/misc/diff.py#L30-L64" rel="nofollow noreferrer">diff.py</a> you can see how I'm using the library to get inspiration for creating <code>dmp_quick_ratio</code>.

My tests showed that using <a href="https://pypi.org/project/diff-match-patch/" rel="nofollow noreferrer">diff_match_patch</a> was 20 times faster than Python's <code>difflib</code>.

blocks|key|1356939|text|cdifflib是difflib.SequenceMatcher的一个C实现。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1356940|替换SequenceMatcher，所有的difflib操作都会快4倍左右|1356941|from+cdifflib+import+CSequenceMatcher
import+difflib
difflib.SequenceMatcher+=+CSequenceMatcher|code-block|syntax|javascript|1356942|entityMap|0|LINK|mutability|MUTABLE|url|https://pypi.org/project/cdifflib^0|9|N|0|8|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]]|D|@$9|X|A|Y|1|Z]]|E|$]]|$1|F|3|G|5|6|7|10|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|11|8|@]|D|@]|E|$K|L]]|$1|M|3|-4|5|6|7|12|8|@]|D|@]|E|$]]]|N|$O|$5|P|Q|R|E|$S|T]]]]

There is a C implementation of <code>difflib.SequenceMatcher</code>, <a href="https://pypi.org/project/cdifflib" rel="nofollow noreferrer">cdifflib</a>.
Replace the SequenceMatcher and all difflib operations will be about 4x faster
<pre><code>from cdifflib import CSequenceMatcher
import difflib
difflib.SequenceMatcher = CSequenceMatcher
</code></pre>

blocks|key|1346070|text|您可以使用pypy获得较小的加速比|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1346071|http://pypy.org/|offset|length|1346072|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|G|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@$D|O|E|P|1|Q]]|A|$]]|$1|F|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|C]]]]

You can get a small speedup using pypy

<a href="http://pypy.org/" rel="nofollow">http://pypy.org/</a>

I'm using difflib SequenceMatcher (ratio() method) to define similarity between text files. While difflib is relatively fast to compare a small set of text files e.g. 10 files of 70 kb on average comparing to each other (46 comparisons) takes about 80 seconds.

The issue here is that i have a collection of 3000 txt files (75 kb on average), a raw estimation on how much time SequenceMatcher needs to complete the comparison job is 80 days!

I tried "real_quick_ratio()" and "quick_ratio()" methods, but they don't fit to our needs.

Is there any way to speed up the comparison process? 
If not, is there any other faster method to do such a task? Even if it is not in Python.

Python's difflib SequenceMatcher speed up

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我正在使用difflib比率( SequenceMatcher ()方法)来定义文本文件之间的相似性。虽然difflib比较一小部分文本文件(例如平均10个70kb的文件)相对较快，但相互比较(46次比较)需要大约80秒。这里的问题是，我收集了3000个txt文件(平均大小为75KB)，粗略估计SequenceMatcher需要多少时间才能完成比较工作需要80天！我尝试了"real_quick_r

问Python的difflib SequenceMatcher加速
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python的difflib SequenceMatcher加速EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python的difflib SequenceMatcher加速
EN