我在Python中使用gensim进行主题建模。目前,我有一个问题。如果我不把人的名字命名为“主题最相关的术语”,那么人的名字就会从“主题最相关的术语”中消失。
你能推荐另一个可以处理人名问题的柠檬化代码吗?或者,是否有文献报道称柠檬化可能无法正确处理人名?我用了下面的代码。
谢谢!
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
import spacy
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])发布于 2022-08-04 06:39:36
名称符合'PROPN' postag的条件。将其添加到allowed_postags列表可能会修复这个问题。
编辑:一个简化的示例(接受原始文本,提供一个后标记列表作为调试输出,并返回已命名的未拆分文本)。基本上,除了allowed_postags之外,什么都不应该改变,但这可能会让您了解真正需要的标记。
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'PROPN']):
texts_out = []
for sent in texts:
doc = nlp(sent)
print(list(token.pos_ for token in doc))
texts_out.append(' '.join([token.lemma_ for token in doc if token.pos_ in allowed_postags]))
return texts_out
texts = [
'Alex Jones caught in lie about Sandy Hook texts during brutal cross-examination',
'China sets military drills around Taiwan',
'Analysis: "Slap in the face": Biden\'s fist bump with MBS fails to pay off',
'Closing arguments expected in WNBA star Brittney Griner\'s Russia drug-smuggling trial'
]
lemmatization(texts)https://datascience.stackexchange.com/questions/113229
复制相似问题