我使用NLP从SEC不同年份的文件中提取包含特定关键字的句子。我把输出通过熊猫数据存储在平方米。到目前一切尚好。当我想比较两个不同年份的句子时,比如2022年和2021年,问题就来了。
我一直在使用以下查询:
query = "select Nvidia_2022.Research as Research_2022, Nvidia_2021.Research as Research_2021 from Nvidia_2022 join Nvidia_2021 where '%' || Nvidia_2022.Research || '%' like '%' || Nvidia_2021.Research || '%'"
这在大多数情况下适用于完全相同的句子。这是输出。
['Such license and development arrangements can further enhance the reach of our technology.'
'Such license and development arrangements can further enhance the reach of our technology.']
然而,有时句子略有不同,如下所示:
['We have invested over $29 billion in research and development since our inception, yielding inventions that are essential to modern computing.'
'We have invested over $24 billion in research and development since our inception, yielding inventions that are essential to modern computing.']
290亿美元对240亿美元
或者在句子的结尾还有其他的区别:
'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; cryptocurrency mining processors, or CMP; Jetson for robotics and other embedded platforms; and NVIDIA AI Enterprise and other software.'
'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; and Jetson for robotics and other embedded platforms.'
我的问题:
在sqlite或其他sql数据库中是否有一种方法可以进行尽可能多的文本比较工作,然后将最复杂的句子传递给python来进行类似于levenshtein_distance或transformers语句的比较?
或者,我是否应该停止使用SQL比较查询,立即着手处理python中的繁重工作?
我试图尽可能多地利用sql,因为它比计算python中的距离要快得多。
发布于 2022-05-08 07:01:42
发布于 2022-05-08 05:54:48
像雪花这样的一些实现具有编辑距离:https://docs.snowflake.com/en/sql-reference/functions/editdistance.html
如果您真的想在sql中执行此操作,则可以将其标记为
但我不认为sql对于这类操作来说一定更快,也不像python库那么健壮。
https://stackoverflow.com/questions/72158390
复制相似问题