我有一个有80万行的dataframe,对于每一行,我想找到每个注释(row.comment)中提到的人。我想使用节,因为它具有更高的精度,并且我用df.iterrows()实现了并行化,以提高执行速度。当我尝试实现节来查找没有多处理的人的名字时,它可以工作,当我尝试做同样的事情时,但是使用SpaCy也会工作,这意味着这个问题与这个包有关。
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize, ner') # initialize English neural pipeline
def stanza_function(arg):
try:
idx,row = arg
comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment
person_name = ''
doc = nlp(str(comment))
persons_mentioned = [word.text for word in doc.ents if word.type == 'PERSON']
if (len(persons_mentioned) == 1):
person_name = persons_mentioned[0]
except:
print("Error")
return person_name
def spacy_function(arg):
idx,row = arg
comment = preprocess_comment(str(row['comment'])) # Retrieve body of the comment
person_name = ''
comment_NER = NER(str(comment)) # Implement NER
persons_mentioned = [word.text for word in comment_NER.ents if word.label_ == 'PERSON']
print(persons_mentioned)
if (len(persons_mentioned) == 1):
person_name = persons_mentioned[0]
return person_namepool = mp.Pool(processes=mp.cpu_count())
persons = pool.map(stanza_function, [(idx,row) for idx,row in df.iterrows()])
df['person_name'] = persons发布于 2022-06-07 19:09:16
https://github.com/stanfordnlp/stanza/issues/1007
如前所述,无论怎样,MP都不会对节有所帮助,特别是在使用GPU时。
https://stackoverflow.com/questions/71950766
复制相似问题