对不起,这是我的第一个问题!
我开始在python上进行命名实体识别,并使用了ne_chunk和stanza。
我想知道他们为NER预先训练的模型之间的区别。它们如何识别命名实体?
发布于 2021-01-21 16:07:08
节
<代码>G223
优势
从原始文本到注释: purpose.

pip install stanza
导入节
它将下载以下链接
#http://nlp.stanford.edu/software/stanza/1.1.0/en/default.zip #https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json nlp = stanza.Pipeline(lang='en',processors='tokenize,ner') #初始化英语神经管道doc =nlp(“约翰·F·肯尼迪国际机场是位于美国纽约皇后区的一个国际机场,也是为纽约市提供服务的主要机场之一。”)#在句子打印上运行注释(*f‘实体:{ent.text}\t类型:{ent.type}’for sent in doc.sentences for ent in sent.ents,sep='\n')正在初始化2021-01-21 11:36:23信息:正在为语言加载这些型号: en (英语):========================= |处理器|软件包|-| tokenize | ewt || ner | ontonotes | ========================= 2021-01-21 11:36:23信息:使用设备: cpu 2021-01-21 11:36:23信息:正在加载: tokenize 2021-01-21 -2111:36:23信息:正在加载: ner 2021-01-21 11:36:26信息:完成处理器加载!
节输出
entity: John F. Kennedy International Airport type: FAC
entity: Queens type: GPE
entity: New York type: GPE
entity: USA type: GPE
entity: one type: CARDINAL
entity: New York City type: GPEne_chunk
下面是一些代码片段,解释了如何使用这两种方法,以及如何在它们之间进行转换:
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk import pos_tag
from nltk import word_tokenize
from nltk.chunk import ne_chunk这些模型是在NLTK中的CoNLL (来自CoNLL会议)语料库上训练的。因为我们已经完成了标记化、位置标记和分块,所以对于基于树的标记,我们所需要做的就是使用conlltags2tree
sentence = """John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City."""
ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
# print(subtree.label())
# print(subtree.leaves())
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print(ne_in_sent)NLTK输出
[('John', 'PERSON'), ('Kennedy International Airport', 'PERSON'), ('Queens', 'GPE'), ('New York', 'GPE'), ('USA', 'ORGANIZATION'), ('New York City', 'GPE')]在我看来,Stanza比NLTK更高级。
因为有时模棱两可/混乱的例子:约翰·F·肯尼迪(机场与人)节NER确定实体:约翰·F·肯尼迪国际机场类型: FAC (设施建筑,机场,高速公路,桥梁等)
节被正确识别
NLTK NER确定实体:('John','PERSON'),(‘肯尼迪国际机场’,'PERSON')
部分鉴定出NLTK。
我还和AllenNLP进行了交叉验证
AllenNLP
import allennlp
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")
document = """John F. Kennedy International Airport is an international airport in Queens, New York, USA, and one of the primary airports serving New York City. """
def convert_results(allen_results):
ents = set()
for word, tag in zip(allen_results["words"], allen_results["tags"]):
if tag != "O":
ent_position, ent_type = tag.split("-")
if ent_position == "U":
ents.add((word,ent_type))
else:
if ent_position == "B":
w = word
elif ent_position == "I":
w += " " + word
elif ent_position == "L":
w += " " + word
ents.add((w,ent_type))
return ents
def allennlp_ner(document):
return convert_results(predictor.predict(sentence=document))
allennlp_ner(document)AllenNLP输出
{('John F. Kennedy International Airport', 'LOC'), ('New York', 'LOC'), ('New York City', 'LOC'), ('Queens', 'LOC'), ('USA', 'LOC')}https://stackoverflow.com/questions/64671500
复制相似问题