我有一个有系列字幕的文件夹。我想从文件夹中得到每集一个字幕文件。我的问题是,有些字幕是在同一集,但名称不同,如
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.720p.HDTV.x264-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.902.720p.HDTV.x264.MOMENTUM.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.9X02.HDTV.XviD-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.HDTV.XviD-MOMENTUM.srt因此,它们非常相似,但不是100%相同。
我如何删除重复的文档,并保持与不同的情节字幕?
我会附上我试过的但不幸的是我很无知..。
发布于 2017-03-20 12:14:32
您可以在文档之间使用余弦相似性。
假设类似的文档将具有很高的相似性,然后您可以应用一个阈值,在此阈值以上的文档将被视为相同的。
例如,如果这些是您的文档:
1."The child went home today, and his mother waited for him"
2."My car is big"
3."The kid went to his house today, while his mama waited for him to come"我使用来自vpekar的the answer代码,并执行以下操作:
>>> v1 = text_to_vector("the child went home today, and his mother waited for him")
>>> v2 = text_to_vector("My car is big, so said my mother")
>>> v3 = text_to_vector("The kid went to his house today, while his mama waited for him to come")向量之间的余弦相似性是:
>>> get_cosine(v1,v2)
0.10660035817780521
>>> get_cosine(v1,v3)
0.48420012470625223
>>> get_cosine(v2,v3)
0.0很明显,文档1和3是最相似的,因此可能是同一集的字幕。因此,概括地说:
https://stackoverflow.com/questions/42903174
复制相似问题