文章/答案/技术大牛

发布

社区首页 >问答首页 >如何比较文本，选择类似的句子？

问如何比较文本，选择类似的句子？
EN

Stack Overflow用户

提问于 2022-05-08 05:38:19

回答 2查看 151关注 0票数 0

我使用NLP从SEC不同年份的文件中提取包含特定关键字的句子。我把输出通过熊猫数据存储在平方米。到目前一切尚好。当我想比较两个不同年份的句子时，比如2022年和2021年，问题就来了。

我一直在使用以下查询：

query = "select Nvidia_2022.Research as Research_2022, Nvidia_2021.Research as Research_2021 from Nvidia_2022 join Nvidia_2021 where '%' || Nvidia_2022.Research || '%' like '%' || Nvidia_2021.Research || '%'"

这在大多数情况下适用于完全相同的句子。这是输出。

['Such license and development arrangements can further enhance the reach of our technology.'

'Such license and development arrangements can further enhance the reach of our technology.']

然而，有时句子略有不同，如下所示：

['We have invested over $29 billion in research and development since our inception, yielding inventions that are essential to modern computing.'

'We have invested over $24 billion in research and development since our inception, yielding inventions that are essential to modern computing.']

290亿美元对240亿美元

或者在句子的结尾还有其他的区别：

'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; cryptocurrency mining processors, or CMP; Jetson for robotics and other embedded platforms; and NVIDIA AI Enterprise and other software.'

'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; and Jetson for robotics and other embedded platforms.'

我的问题：

在sqlite或其他sql数据库中是否有一种方法可以进行尽可能多的文本比较工作，然后将最复杂的句子传递给python来进行类似于levenshtein_distance或transformers语句的比较？

或者，我是否应该停止使用SQL比较查询，立即着手处理python中的繁重工作？

我试图尽可能多地利用sql，因为它比计算python中的距离要快得多。

python

sql

sqlite

sentence-similarity

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-05-08 07:01:42

sqlite3支持使用FTS5扩展进行全文搜索。

您必须创建一个虚拟表，然后可以使用MATCH关键字。

-- create a virtual table
CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

-- populate it ...

-- perform a full text search
SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;

票数 1

Stack Overflow用户

发布于 2022-05-08 05:54:48

像雪花这样的一些实现具有编辑距离：https://docs.snowflake.com/en/sql-reference/functions/editdistance.html

如果您真的想在sql中执行此操作，则可以将其标记为

空间上的拆分varchar ->数组
在CTE中取消嵌套/扁平数组
重复步骤1和步骤2，以便将句子与
加入两个CTE以查看共有的令牌数量

但我不认为sql对于这类操作来说一定更快，也不像python库那么健壮。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72158390

复制

相似问题

问如何比较文本，选择类似的句子？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何比较文本，选择类似的句子？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何比较文本，选择类似的句子？
EN