我试着从文本中提取关键字。通过使用"en_core_sci_lg“模型,我得到了一个元组类型的短语/单词,并试图从其中删除一些重复的词组。我尝试过列表和元组的去重复函数,结果失败了。有人能帮忙吗?我真的很感激。
text = """spaCy is an open-source software library for advanced natural language processing,
written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""我尝试过的一套代码:
import spacy
nlp = spacy.load("en_core_sci_lg")
doc = nlp(text)
my_tuple = list(set(doc.ents))
print('original tuple', doc.ents, len(doc.ents))
print('after set function', my_tuple, len(my_tuple))产出:
original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16期望的输出是(应该有一个MIT,名称Ines Honnibal应该在一起):
[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]发布于 2022-02-09 22:08:46
doc.ents不是字符串列表。它是Span对象的列表。当您打印一个内容时,它会打印它的内容,但是它们确实是单独的对象,这就是为什么set没有看到它们是重复的。提示是,在打印语句中没有引号。如果是字符串,你会看到引号。
您应该尝试使用doc.words而不是doc.ents。如果这对你不起作用,出于某种原因,你可以:
my_tuple = list(set(e.text for e in doc.ents))https://stackoverflow.com/questions/71057313
复制相似问题